Replace and merge rows in pandas according to condition
Question:
I have a dataframe:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
And a reference dictionary:
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
My goal is to replace all occurrences of replacement_dict['X1']
with ‘X1’, and then merge the rows together. For example, any instance of ‘x1’, ‘x2’, ‘x3’ or ‘x4’ will be replaced by ‘X1’, etc.
I can do this by selecting the rows that contain any of these strings and replacing them with ‘X1’:
keys = replacement_dict.keys()
for key in keys:
DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key
giving:
lft rel rgt num
0 t3 r3 Z1 3
1 t1 r3 X1 9
2 X1 r3 t2 8
3 X1 r1 t2 4
4 t1 r1 Z1 1
5 X1 r1 t2 2
6 X1 r2 t4 4
7 Z1 r2 t4 5
8 t4 r3 X1 4
9 Z1 r2 t3 4
Now, if I select all the rows containing ‘X1’ and merge them, I should end up with:
lft rel rgt num
0 X1 r3 t2 8
1 X1 r1 t2 6
2 X1 r2 t4 4
3 t1 r3 X1 9
4 t4 r3 X1 4
So the three columns [‘lft’, ‘rel’, ‘rgt’] are unique while the ‘num’ column is added up for each of these rows. The row 1 above : [‘X1’ ‘r1’ ‘t2’ 6] is the sum of two rows [‘X1’ ‘r1’ ‘t2’ 4] and [‘X1’ ‘r1’ ‘t2’ 2].
I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.
How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?
Answers:
If you flip the keys and values of your replacement_dict
, things become a lot easier:
new_replacement_dict = {
v: key
for key, values in replacement_dict.items()
for v in values
}
cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
Pandas has built in function replace that is faster than going through the whole dataframe with .loc
You can also pass a list in it making our dictionary good fit for it
keys = replacement_dict.keys()
# Loop through every value in our dictionary and get the replacements
for key in keys:
DF = DF.replace(to_replace=replacement_dict[key], value=key)
Here’s a way to do what your question asks:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
.groupby(['lft','rel','rgt']).sum().reset_index() )
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Explanation:
replace()
uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columns lft
and rgt
- after filtering for rows with
'X1'
found in either lft
or rgt
, use groupby()
, sum()
and reset_index()
to sum the num
column for unique lft, rel, rgt
group keys and restore the group components from index levels to columns.
As an alternative, we can use query()
to select only rows containing 'X1'
:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
.groupby(['lft','rel','rgt']).sum().reset_index() )
Reverse the replacement_dict
mapping and map()
this new mapping to each of lft and rgt columns to substitute certain values (e.g. x1->X1, y2->Y1 etc.). As some values in lft and rgt columns don’t exist in the mapping (e.g. t1, t2 etc.), call fillna()
to fill in these values.1
You may also stack()
the columns whose values need to be replaced (lft and rgt), call map+fillna and unstack()
back but because there are only 2 columns, it may not be worth the trouble for this particular case.
The second part of the question may be answered by summing num values after grouping by lft, rel and rgt columns; so groupby().sum()
should do the trick.
# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# substitute values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# substitute values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
result = df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()
1: map()
+ fillna()
may perform better for your use case than replace()
because under the hood, map()
implements a Cython optimized take_nd()
method that performs particularly well if there are a lot of values to replace, while replace()
implements replace_list()
method which uses a Python loop. So if replacement_dict
is particularly large (which it is in your case), the difference in performance will be huge, but if replacement_dict
is small, replace()
may outperform map()
.
Try this, I commented the steps
#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}
# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)
# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]
# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
lots of great answers. I avoid the need for the dict and use a df.apply()
like this to generate new data.
import io
import pandas as pd
# # create the data
x = '''
lft rel rgt num
t3 r3 z2 3
t1 r3 x1 9
x2 r3 t2 8
x4 r1 t2 4
t1 r1 z3 1
x1 r1 t2 2
x2 r2 t4 4
z3 r2 t4 5
t4 r3 x3 4
z1 r2 t3 4
'''
data = io.StringIO(x)
df = pd.read_csv(data, sep=' ')
print(df)
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
def replace(x):
# which key to check
key_check = x[0] + '1'
key_check = key_check.upper()
return key_check
df['new'] = df['lft'].apply(replace)
df
return this:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
lft rel rgt num new
0 t3 r3 z2 3 T1
1 t1 r3 x1 9 T1
2 x2 r3 t2 8 X1
3 x4 r1 t2 4 X1
4 t1 r1 z3 1 T1
5 x1 r1 t2 2 X1
6 x2 r2 t4 4 X1
7 z3 r2 t4 5 Z1
8 t4 r3 x3 4 T1
9 z1 r2 t3 4 Z1
I have a dataframe:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
And a reference dictionary:
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
My goal is to replace all occurrences of replacement_dict['X1']
with ‘X1’, and then merge the rows together. For example, any instance of ‘x1’, ‘x2’, ‘x3’ or ‘x4’ will be replaced by ‘X1’, etc.
I can do this by selecting the rows that contain any of these strings and replacing them with ‘X1’:
keys = replacement_dict.keys()
for key in keys:
DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key
giving:
lft rel rgt num
0 t3 r3 Z1 3
1 t1 r3 X1 9
2 X1 r3 t2 8
3 X1 r1 t2 4
4 t1 r1 Z1 1
5 X1 r1 t2 2
6 X1 r2 t4 4
7 Z1 r2 t4 5
8 t4 r3 X1 4
9 Z1 r2 t3 4
Now, if I select all the rows containing ‘X1’ and merge them, I should end up with:
lft rel rgt num
0 X1 r3 t2 8
1 X1 r1 t2 6
2 X1 r2 t4 4
3 t1 r3 X1 9
4 t4 r3 X1 4
So the three columns [‘lft’, ‘rel’, ‘rgt’] are unique while the ‘num’ column is added up for each of these rows. The row 1 above : [‘X1’ ‘r1’ ‘t2’ 6] is the sum of two rows [‘X1’ ‘r1’ ‘t2’ 4] and [‘X1’ ‘r1’ ‘t2’ 2].
I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.
How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?
If you flip the keys and values of your replacement_dict
, things become a lot easier:
new_replacement_dict = {
v: key
for key, values in replacement_dict.items()
for v in values
}
cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
Pandas has built in function replace that is faster than going through the whole dataframe with .loc
You can also pass a list in it making our dictionary good fit for it
keys = replacement_dict.keys()
# Loop through every value in our dictionary and get the replacements
for key in keys:
DF = DF.replace(to_replace=replacement_dict[key], value=key)
Here’s a way to do what your question asks:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
.groupby(['lft','rel','rgt']).sum().reset_index() )
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Explanation:
replace()
uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columnslft
andrgt
- after filtering for rows with
'X1'
found in eitherlft
orrgt
, usegroupby()
,sum()
andreset_index()
to sum thenum
column for uniquelft, rel, rgt
group keys and restore the group components from index levels to columns.
As an alternative, we can use query()
to select only rows containing 'X1'
:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
.groupby(['lft','rel','rgt']).sum().reset_index() )
Reverse the replacement_dict
mapping and map()
this new mapping to each of lft and rgt columns to substitute certain values (e.g. x1->X1, y2->Y1 etc.). As some values in lft and rgt columns don’t exist in the mapping (e.g. t1, t2 etc.), call fillna()
to fill in these values.1
You may also stack()
the columns whose values need to be replaced (lft and rgt), call map+fillna and unstack()
back but because there are only 2 columns, it may not be worth the trouble for this particular case.
The second part of the question may be answered by summing num values after grouping by lft, rel and rgt columns; so groupby().sum()
should do the trick.
# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# substitute values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# substitute values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
result = df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()
1: map()
+ fillna()
may perform better for your use case than replace()
because under the hood, map()
implements a Cython optimized take_nd()
method that performs particularly well if there are a lot of values to replace, while replace()
implements replace_list()
method which uses a Python loop. So if replacement_dict
is particularly large (which it is in your case), the difference in performance will be huge, but if replacement_dict
is small, replace()
may outperform map()
.
Try this, I commented the steps
#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}
# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)
# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]
# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
lots of great answers. I avoid the need for the dict and use a df.apply()
like this to generate new data.
import io
import pandas as pd
# # create the data
x = '''
lft rel rgt num
t3 r3 z2 3
t1 r3 x1 9
x2 r3 t2 8
x4 r1 t2 4
t1 r1 z3 1
x1 r1 t2 2
x2 r2 t4 4
z3 r2 t4 5
t4 r3 x3 4
z1 r2 t3 4
'''
data = io.StringIO(x)
df = pd.read_csv(data, sep=' ')
print(df)
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
def replace(x):
# which key to check
key_check = x[0] + '1'
key_check = key_check.upper()
return key_check
df['new'] = df['lft'].apply(replace)
df
return this:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
lft rel rgt num new
0 t3 r3 z2 3 T1
1 t1 r3 x1 9 T1
2 x2 r3 t2 8 X1
3 x4 r1 t2 4 X1
4 t1 r1 z3 1 T1
5 x1 r1 t2 2 X1
6 x2 r2 t4 4 X1
7 z3 r2 t4 5 Z1
8 t4 r3 x3 4 T1
9 z1 r2 t3 4 Z1