Replace and merge rows in pandas according to condition

Question:

I have a dataframe:

   lft rel rgt num
0   t3  r3  z2  3
1   t1  r3  x1  9
2   x2  r3  t2  8
3   x4  r1  t2  4
4   t1  r1  z3  1
5   x1  r1  t2  2
6   x2  r2  t4  4
7   z3  r2  t4  5
8   t4  r3  x3  4
9   z1  r2  t3  4

And a reference dictionary:

replacement_dict = {
    'X1' : ['x1', 'x2', 'x3', 'x4'],
    'Y1' : ['y1', 'y2'],
    'Z1' : ['z1', 'z2', 'z3']
}

My goal is to replace all occurrences of replacement_dict['X1'] with ‘X1’, and then merge the rows together. For example, any instance of ‘x1’, ‘x2’, ‘x3’ or ‘x4’ will be replaced by ‘X1’, etc.

I can do this by selecting the rows that contain any of these strings and replacing them with ‘X1’:

keys = replacement_dict.keys()
for key in keys:
    DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
    DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key

giving:

    lft rel rgt num
0   t3  r3  Z1  3
1   t1  r3  X1  9
2   X1  r3  t2  8
3   X1  r1  t2  4
4   t1  r1  Z1  1
5   X1  r1  t2  2
6   X1  r2  t4  4
7   Z1  r2  t4  5
8   t4  r3  X1  4
9   Z1  r2  t3  4

Now, if I select all the rows containing ‘X1’ and merge them, I should end up with:

    lft rel rgt num
0   X1  r3  t2  8
1   X1  r1  t2  6
2   X1  r2  t4  4
3   t1  r3  X1  9
4   t4  r3  X1  4

So the three columns [‘lft’, ‘rel’, ‘rgt’] are unique while the ‘num’ column is added up for each of these rows. The row 1 above : [‘X1’ ‘r1’ ‘t2’ 6] is the sum of two rows [‘X1’ ‘r1’ ‘t2’ 4] and [‘X1’ ‘r1’ ‘t2’ 2].

I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.

How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?

Asked By: vineeth venugopal

||

Answers:

If you flip the keys and values of your replacement_dict, things become a lot easier:

new_replacement_dict = {
    v: key
    for key, values in replacement_dict.items()
    for v in values
}

cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
Answered By: Code Different

Pandas has built in function replace that is faster than going through the whole dataframe with .loc

You can also pass a list in it making our dictionary good fit for it

keys = replacement_dict.keys()

# Loop through every value in our dictionary and get the replacements

for key in keys:
  DF = DF.replace(to_replace=replacement_dict[key], value=key)
Answered By: Jimpsoni

Here’s a way to do what your question asks:

df[['lft','rgt']] = ( df[['lft','rgt']]
    .replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
    .groupby(['lft','rel','rgt']).sum().reset_index() )

Output:

  lft rel rgt  num
0  X1  r1  t2    6
1  X1  r2  t4    4
2  X1  r3  t2    8
3  t1  r3  X1    9
4  t4  r3  X1    4

Explanation:

  • replace() uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columns lft and rgt
  • after filtering for rows with 'X1' found in either lft or rgt, use groupby(), sum() and reset_index() to sum the num column for unique lft, rel, rgt group keys and restore the group components from index levels to columns.

As an alternative, we can use query() to select only rows containing 'X1':

df[['lft','rgt']] = ( df[['lft','rgt']]
    .replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
    .groupby(['lft','rel','rgt']).sum().reset_index() )
Answered By: constantstranger

Reverse the replacement_dict mapping and map() this new mapping to each of lft and rgt columns to substitute certain values (e.g. x1->X1, y2->Y1 etc.). As some values in lft and rgt columns don’t exist in the mapping (e.g. t1, t2 etc.), call fillna() to fill in these values.1

You may also stack() the columns whose values need to be replaced (lft and rgt), call map+fillna and unstack() back but because there are only 2 columns, it may not be worth the trouble for this particular case.

The second part of the question may be answered by summing num values after grouping by lft, rel and rgt columns; so groupby().sum() should do the trick.

# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}

# substitute values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# substitute values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])

# sum values in num column by groups
result = df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()

1: map() + fillna() may perform better for your use case than replace() because under the hood, map() implements a Cython optimized take_nd() method that performs particularly well if there are a lot of values to replace, while replace() implements replace_list() method which uses a Python loop. So if replacement_dict is particularly large (which it is in your case), the difference in performance will be huge, but if replacement_dict is small, replace() may outperform map().

Try this, I commented the steps

#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}

# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)

# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]

# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)

Output:

  lft rel rgt  num
0  X1  r1  t2    6
1  X1  r2  t4    4
2  X1  r3  t2    8
3  t1  r3  X1    9
4  t4  r3  X1    4
Answered By: Rabinzel

lots of great answers. I avoid the need for the dict and use a df.apply() like this to generate new data.

import io
import pandas as pd


# # create the data
x = '''
lft rel rgt num
t3 r3 z2 3
t1 r3 x1 9
x2 r3 t2 8
x4 r1 t2 4
t1 r1 z3 1
x1 r1 t2 2
x2 r2 t4 4
z3 r2 t4 5
t4 r3 x3 4
z1 r2 t3 4
'''


data = io.StringIO(x)
df = pd.read_csv(data, sep=' ')
print(df)

replacement_dict = {
    'X1' : ['x1', 'x2', 'x3', 'x4'],
    'Y1' : ['y1', 'y2'],
    'Z1' : ['z1', 'z2', 'z3']
}


def replace(x):
    # which key to check
    key_check = x[0] + '1'
    key_check = key_check.upper()

    return key_check


df['new'] = df['lft'].apply(replace)
df

return this:

  lft rel rgt  num
0  t3  r3  z2    3
1  t1  r3  x1    9
2  x2  r3  t2    8
3  x4  r1  t2    4
4  t1  r1  z3    1
5  x1  r1  t2    2
6  x2  r2  t4    4
7  z3  r2  t4    5
8  t4  r3  x3    4
9  z1  r2  t3    4
  lft rel rgt  num new
0  t3  r3  z2    3  T1
1  t1  r3  x1    9  T1
2  x2  r3  t2    8  X1
3  x4  r1  t2    4  X1
4  t1  r1  z3    1  T1
5  x1  r1  t2    2  X1
6  x2  r2  t4    4  X1
7  z3  r2  t4    5  Z1
8  t4  r3  x3    4  T1
9  z1  r2  t3    4  Z1
Answered By: D.L
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.