pandas – Merge nearly duplicate rows based on column value
Question:
I have a pandas
dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or “coalesce” these rows into a single row, without summing the numerical values.
Here is an example of what I’m working with:
Name Sid Use_Case Revenue
A xx01 Voice $10.00
A xx01 SMS $10.00
B xx02 Voice $5.00
C xx03 Voice $15.00
C xx03 SMS $15.00
C xx03 Video $15.00
And here is what I would like:
Name Sid Use_Case Revenue
A xx01 Voice, SMS $10.00
B xx02 Voice $5.00
C xx03 Voice, SMS, Video $15.00
The reason I don’t want to sum the “Revenue” column is because my table is the result of doing a pivot over several time periods where “Revenue” simply ends up getting listed multiple times instead of having a different value per “Use_Case”.
What would be the best way to tackle this issue? I’ve looked into the groupby()
function but I still don’t understand it very well.
Answers:
I think you can use groupby
with aggregate
first
and custom function ', '.join
:
df = df.groupby('Name').agg({'Sid':'first',
'Use_Case': ', '.join,
'Revenue':'first' }).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
Nice idea from comment, thanks Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
You can groupby
and apply
the list
function:
>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
Name Sid Revenue 0
0 A xx01 $10.00 [Voice, SMS]
1 B xx02 $5.00 [Voice]
2 C xx03 $15.00 [Voice, SMS, Video]
(In case you are concerned about duplicates, use set
instead of list
.)
I was using some code that I didn’t think was optimal and eventually found jezrael’s answer. But after using it and running a timeit
test, I actually went back to what I was doing, which was:
cmnts = {}
for i, row in df.iterrows():
while True:
try:
if row['Use_Case']:
cmnts[row['Name']].append(row['Use_Case'])
else:
cmnts[row['Name']].append('n/a')
break
except KeyError:
cmnts[row['Name']] = []
df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
According to my 100 run timeit
test, the iterate and replace method is an order of magnitude faster than the groupby
method.
import pandas as pd
from my_stuff import time_something
df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
'b': [i for i in range(1, 10001)]})
runs = 100
interim_dict = 'txt = {}n'
'for i, row in df.iterrows():n'
' try:n'
" txt[row['a']].append(row['b'])nn"
' except KeyError:n'
" txt[row['a']] = []n"
"df.drop_duplicates('a', inplace=True)n"
"df['b'] = ['; '.join(v) for v in txt.values()]"
grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"
print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
yields:
Interim Dict
Total: 59.1164s
Avg: 591163748.5887ns
Group By
Total: 430.6203s
Avg: 4306203366.1827ns
where time_something
is a function which times a snippet with timeit
and returns the result in the above format.
Following @jezrael and @leoschet answers, I would like to provide a more general example in case there are many more columns in the dataframe, something I had to do recently.
Specifically, my dataframe had a total of 184 columns.
The column REF
is the one that should be used as a reference for the groupby
and only another one, called IDS
, of the remaining 182, was different and I wanted to collapse its elements into a list id1, id2, id3
…
So:
# Create a dictionary {df_all_columns_name : 'first', 'IDS': join} for agg
# Also avoid REF column in dictionary (inserted after aggregation)
columns_collapse = {c: 'first' if c != 'IDS' else ', '.join for c in my_df.columns.tolist() if c != 'REF'}
my_df = my_df.groupby('REF').agg(columns_collapse).reset_index()
I hope this is also useful to someone!
Regards!
I have a pandas
dataframe with several rows that are near duplicates of each other, except for one value. My goal is to merge or “coalesce” these rows into a single row, without summing the numerical values.
Here is an example of what I’m working with:
Name Sid Use_Case Revenue
A xx01 Voice $10.00
A xx01 SMS $10.00
B xx02 Voice $5.00
C xx03 Voice $15.00
C xx03 SMS $15.00
C xx03 Video $15.00
And here is what I would like:
Name Sid Use_Case Revenue
A xx01 Voice, SMS $10.00
B xx02 Voice $5.00
C xx03 Voice, SMS, Video $15.00
The reason I don’t want to sum the “Revenue” column is because my table is the result of doing a pivot over several time periods where “Revenue” simply ends up getting listed multiple times instead of having a different value per “Use_Case”.
What would be the best way to tackle this issue? I’ve looked into the groupby()
function but I still don’t understand it very well.
I think you can use groupby
with aggregate
first
and custom function ', '.join
:
df = df.groupby('Name').agg({'Sid':'first',
'Use_Case': ', '.join,
'Revenue':'first' }).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
Nice idea from comment, thanks Goyo:
df = df.groupby(['Name','Sid','Revenue'])['Use_Case'].apply(', '.join).reset_index()
#change column order
print df[['Name','Sid','Use_Case','Revenue']]
Name Sid Use_Case Revenue
0 A xx01 Voice, SMS $10.00
1 B xx02 Voice $5.00
2 C xx03 Voice, SMS, Video $15.00
You can groupby
and apply
the list
function:
>>> df['Use_Case'].groupby([df.Name, df.Sid, df.Revenue]).apply(list).reset_index()
Name Sid Revenue 0
0 A xx01 $10.00 [Voice, SMS]
1 B xx02 $5.00 [Voice]
2 C xx03 $15.00 [Voice, SMS, Video]
(In case you are concerned about duplicates, use set
instead of list
.)
I was using some code that I didn’t think was optimal and eventually found jezrael’s answer. But after using it and running a timeit
test, I actually went back to what I was doing, which was:
cmnts = {}
for i, row in df.iterrows():
while True:
try:
if row['Use_Case']:
cmnts[row['Name']].append(row['Use_Case'])
else:
cmnts[row['Name']].append('n/a')
break
except KeyError:
cmnts[row['Name']] = []
df.drop_duplicates('Name', inplace=True)
df['Use_Case'] = ['; '.join(v) for v in cmnts.values()]
According to my 100 run timeit
test, the iterate and replace method is an order of magnitude faster than the groupby
method.
import pandas as pd
from my_stuff import time_something
df = pd.DataFrame({'a': [i / (i % 4 + 1) for i in range(1, 10001)],
'b': [i for i in range(1, 10001)]})
runs = 100
interim_dict = 'txt = {}n'
'for i, row in df.iterrows():n'
' try:n'
" txt[row['a']].append(row['b'])nn"
' except KeyError:n'
" txt[row['a']] = []n"
"df.drop_duplicates('a', inplace=True)n"
"df['b'] = ['; '.join(v) for v in txt.values()]"
grouping = "new_df = df.groupby('a')['b'].apply(str).apply('; '.join).reset_index()"
print(time_something(interim_dict, runs, beg_string='Interim Dict', glbls=globals()))
print(time_something(grouping, runs, beg_string='Group By', glbls=globals()))
yields:
Interim Dict
Total: 59.1164s
Avg: 591163748.5887ns
Group By
Total: 430.6203s
Avg: 4306203366.1827ns
where time_something
is a function which times a snippet with timeit
and returns the result in the above format.
Following @jezrael and @leoschet answers, I would like to provide a more general example in case there are many more columns in the dataframe, something I had to do recently.
Specifically, my dataframe had a total of 184 columns.
The column REF
is the one that should be used as a reference for the groupby
and only another one, called IDS
, of the remaining 182, was different and I wanted to collapse its elements into a list id1, id2, id3
…
So:
# Create a dictionary {df_all_columns_name : 'first', 'IDS': join} for agg
# Also avoid REF column in dictionary (inserted after aggregation)
columns_collapse = {c: 'first' if c != 'IDS' else ', '.join for c in my_df.columns.tolist() if c != 'REF'}
my_df = my_df.groupby('REF').agg(columns_collapse).reset_index()
I hope this is also useful to someone!
Regards!