How to organize fields in dataframe by repetition and drop duplicates

Question:

I have this

id phone1 phone2 
1  300    301
1  303    300
1  300    303
2  400    401

Want this

id phone1 phone2 phone3
1  300    303    301
2  400    401

I have tried group by id and column phone1, apply count function, iterate over it adding to a list verifying if is already there the id and phone and sum the third column, and do the same thing with phone2 in the same list

After it reorganize the dataframe iterating the list but this is so slow with the millions of data that i have to proccess

dataframe1 = dataframe.groupby(['id', 'phone1']).count().reset_index()
dataframe2 = dataframe.groupby(['id', 'phone2']).count().reset_index()

result to add in a list

id phone1 phone2
1  300    2    
1  303    1
2  401    1

id phone1 phone2
1  300    1   
1  301    1
1  303    1
2  400    1

Answers:

Iterate a dataframe is so slow and not recomended

You can group the phones and apply the list function for each id, after it organize by duplicates and split in new columns

Answered By: Rosenty

You can melt to reshape the phone columns to rows, then remove the duplicates per group. Finally, pivot to reshape back to wide format.

out = (df
   .melt('id')
   .drop_duplicates(['id', 'value'])
   .assign(col=lambda d: d.groupby('id').cumcount().add(1))
   .pivot_table(index='id', columns='col', values='value', fill_value=pd.NA)
   .astype('Int64') # optional
   .add_prefix('phone')
   .rename_axis(columns=None).reset_index()
)

output:

   id  phone1  phone2  phone3
0   1     300     303     301
1   2     400     401    <NA>
Answered By: mozway

You can achieve this via grouping and sorting groups by count.

First, collect phones from each phone column:

phone_columns = df.columns[1:]
df["phone_list"] = df[phone_columns].apply(list, axis=1)
df = df.groupby("id").agg(phone_list_agg=("phone_list", lambda x: list(itertools.chain.from_iterable(x))))

                                        phone_list_agg
id                                                   
1   [300, 301, 303, 300, 300, 303, 800, 800, 800, 800]
2                                           [400, 401]

Then, group phones and sort by count:

df["phone_tuples"] = df.apply(lambda x: [(k,len(list(g))) for k,g in itertools.groupby(sorted(x["phone_list_agg"]))], axis=1)
df = df.drop("phone_list_agg", axis=1)
df["phone_tuples"] = df.apply(lambda x: sorted(x["phone_tuples"], key=lambda y:y[1], reverse=True), axis=1)

                                phone_tuples
id                                          
1   [(800, 4), (300, 3), (303, 2), (301, 1)]
2                       [(400, 1), (401, 1)]

Finally, unpack tuples into separate columns:

df = pd.DataFrame(data=[[y[0] for y in x] for x in df["phone_tuples"]], index=df.index)
df.columns = [f"phone{i}" for i in range(1, len(df.columns) + 1)]
df = df.reset_index()

   id  phone1  phone2  phone3  phone4
0   1     800     300   303.0   301.0
1   2     400     401     NaN     NaN

You can fill NaN above with some value (for ex. -1) and convert float to int using following:

df = df.fillna(-1).astype(int)

Sample dataset used:

data=[
(1,300,301),
(1,303,300),
(1,300,303),
(2,400,401),
(1,800,800),
(1,800,800),
]

columns = ["id", "phone1", "phone2"]

df = pd.DataFrame(data=data, columns=columns)
Answered By: Azhar Khan
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.