change iterrows() to .loc for large dataframes

Question:

I have 2 data frames, df1 and df2.

Based on the condition in df1 that day_of_week == 7 we have to match 2 other column values, (statWeek and statMonth) if the condition matches then we have to replace as_cost_perf from df2 with cost_eu from df1. in other places we simply keep as_cost_perf as it is.

Below is my code block with iterrows()

in case, I have a huge dataframe it will be time consuming, can someone please help me optimize this snippet?

import pandas as pd

# create df1
data1 = {'day_of_week': [7, 7, 6],
         'statWeek': [1, 2, 3],
         'statMonth': [1, 1, 1],
         'cost_eu': [957940.0, 942553.0, 1177088.0]}
df1 = pd.DataFrame(data1)

# create df2
data2 = {'statWeek': [1, 2, 3, 4, 1, 2, 3],
         'statMonth': [1, 1, 1, 1, 2, 2, 2],
         'as_cost_perf': [344560.0, 334580.0, 334523.0, 556760.0, 124660.0, 124660.0, 763660.0]}
df2 = pd.DataFrame(data2)

# identify rows in df1 where day_of_week == 7
mask = df1['day_of_week'] == 7

# update df2 with cost_eu from df1 where there is a match
for i, row in df1[mask].iterrows():
    matching_rows = df2[(df2['statWeek'] == row['statWeek']) & (df2['statMonth'] == row['statMonth'])]
    if not matching_rows.empty:
        df2.loc[matching_rows.index, 'as_cost_perf'] = row['cost_eu']

# print the updated df2
df2

Thanks in advance!

Asked By: sdave

||

Answers:

You can merge or update but first we need filter df1 since you only care about day_of_week == 7 by doing df1.loc[df1['day_of_week'].eq(7), 'statWeek':]

merge

df2.merge(df1.loc[df1['day_of_week'].eq(7), 'statWeek':],
          on=['statWeek', 'statMonth'], how='left')

   statWeek  statMonth  as_cost_perf   cost_eu
0         1          1      344560.0  957940.0
1         2          1      334580.0  942553.0
2         3          1      334523.0       NaN
3         4          1      556760.0       NaN
4         1          2      124660.0       NaN
5         2          2      124660.0       NaN
6         3          2      763660.0       NaN

update

# we need to set the index if we use update
df2 = df2.set_index(['statWeek', 'statMonth'])
# we set the index for df1.loc[...] and rename the cost_eu column to match df2
df2.update(df1.loc[df1['day_of_week'].eq(7), 'statWeek':]
           .set_index(['statWeek', 'statMonth']).rename(columns={'cost_eu': 'as_cost_perf'}))

print(df2.reset_index())

   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
2         3          1      334523.0
3         4          1      556760.0
4         1          2      124660.0
5         2          2      124660.0
6         3          2      763660.0
Answered By: It_is_Chris

Instead of for loop you can apply df.merge with single reassignment:

mask = df1['day_of_week'] == 7
df2 = df2.merge(df1[mask], on=['statWeek', 'statMonth'], how='left')
matched = ~df2['cost_eu'].isna()
df2.loc[matched, 'as_cost_perf'] = df2.loc[matched, 'cost_eu']
df2.drop(['day_of_week', 'cost_eu'], axis=1, inplace=True)

   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
2         3          1      334523.0
3         4          1      556760.0
4         1          2      124660.0
5         2          2      124660.0
6         3          2      763660.0
Answered By: RomanPerekhrest

You can reformat df1 and concatenate it with df2 then drop duplicates:

upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')
out = pd.concat([upd, df2], axis=0).drop_duplicates(['statWeek', 'statMonth'])

To avoid drop_duplicates, you can simply remove the same rows from df2:

upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')

cols = ['statWeek', 'statMonth']
m = ~df2[cols].isin(upd[cols]).all(axis=1)
out = pd.concat([upd, df2.loc[m]], axis=0)

Output:

>>> out
   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
2         3          1      334523.0
3         4          1      556760.0
4         1          2      124660.0
5         2          2      124660.0
6         3          2      763660.0

>>> upd
   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
Answered By: Corralien