change iterrows() to .loc for large dataframes
Question:
I have 2 data frames, df1 and df2.
Based on the condition in df1
that day_of_week == 7
we have to match 2 other column values, (statWeek and statMonth)
if the condition matches then we have to replace as_cost_perf
from df2 with cost_eu
from df1. in other places we simply keep as_cost_perf as it is.
Below is my code block with iterrows()
in case, I have a huge dataframe it will be time consuming, can someone please help me optimize this snippet?
import pandas as pd
# create df1
data1 = {'day_of_week': [7, 7, 6],
'statWeek': [1, 2, 3],
'statMonth': [1, 1, 1],
'cost_eu': [957940.0, 942553.0, 1177088.0]}
df1 = pd.DataFrame(data1)
# create df2
data2 = {'statWeek': [1, 2, 3, 4, 1, 2, 3],
'statMonth': [1, 1, 1, 1, 2, 2, 2],
'as_cost_perf': [344560.0, 334580.0, 334523.0, 556760.0, 124660.0, 124660.0, 763660.0]}
df2 = pd.DataFrame(data2)
# identify rows in df1 where day_of_week == 7
mask = df1['day_of_week'] == 7
# update df2 with cost_eu from df1 where there is a match
for i, row in df1[mask].iterrows():
matching_rows = df2[(df2['statWeek'] == row['statWeek']) & (df2['statMonth'] == row['statMonth'])]
if not matching_rows.empty:
df2.loc[matching_rows.index, 'as_cost_perf'] = row['cost_eu']
# print the updated df2
df2
Thanks in advance!
Answers:
You can merge
or update
but first we need filter df1
since you only care about day_of_week == 7
by doing df1.loc[df1['day_of_week'].eq(7), 'statWeek':]
merge
df2.merge(df1.loc[df1['day_of_week'].eq(7), 'statWeek':],
on=['statWeek', 'statMonth'], how='left')
statWeek statMonth as_cost_perf cost_eu
0 1 1 344560.0 957940.0
1 2 1 334580.0 942553.0
2 3 1 334523.0 NaN
3 4 1 556760.0 NaN
4 1 2 124660.0 NaN
5 2 2 124660.0 NaN
6 3 2 763660.0 NaN
update
# we need to set the index if we use update
df2 = df2.set_index(['statWeek', 'statMonth'])
# we set the index for df1.loc[...] and rename the cost_eu column to match df2
df2.update(df1.loc[df1['day_of_week'].eq(7), 'statWeek':]
.set_index(['statWeek', 'statMonth']).rename(columns={'cost_eu': 'as_cost_perf'}))
print(df2.reset_index())
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
2 3 1 334523.0
3 4 1 556760.0
4 1 2 124660.0
5 2 2 124660.0
6 3 2 763660.0
Instead of for
loop you can apply df.merge
with single reassignment:
mask = df1['day_of_week'] == 7
df2 = df2.merge(df1[mask], on=['statWeek', 'statMonth'], how='left')
matched = ~df2['cost_eu'].isna()
df2.loc[matched, 'as_cost_perf'] = df2.loc[matched, 'cost_eu']
df2.drop(['day_of_week', 'cost_eu'], axis=1, inplace=True)
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
2 3 1 334523.0
3 4 1 556760.0
4 1 2 124660.0
5 2 2 124660.0
6 3 2 763660.0
You can reformat df1
and concatenate it with df2
then drop duplicates:
upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')
out = pd.concat([upd, df2], axis=0).drop_duplicates(['statWeek', 'statMonth'])
To avoid drop_duplicates
, you can simply remove the same rows from df2
:
upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')
cols = ['statWeek', 'statMonth']
m = ~df2[cols].isin(upd[cols]).all(axis=1)
out = pd.concat([upd, df2.loc[m]], axis=0)
Output:
>>> out
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
2 3 1 334523.0
3 4 1 556760.0
4 1 2 124660.0
5 2 2 124660.0
6 3 2 763660.0
>>> upd
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
I have 2 data frames, df1 and df2.
Based on the condition in df1
that day_of_week == 7
we have to match 2 other column values, (statWeek and statMonth)
if the condition matches then we have to replace as_cost_perf
from df2 with cost_eu
from df1. in other places we simply keep as_cost_perf as it is.
Below is my code block with iterrows()
in case, I have a huge dataframe it will be time consuming, can someone please help me optimize this snippet?
import pandas as pd
# create df1
data1 = {'day_of_week': [7, 7, 6],
'statWeek': [1, 2, 3],
'statMonth': [1, 1, 1],
'cost_eu': [957940.0, 942553.0, 1177088.0]}
df1 = pd.DataFrame(data1)
# create df2
data2 = {'statWeek': [1, 2, 3, 4, 1, 2, 3],
'statMonth': [1, 1, 1, 1, 2, 2, 2],
'as_cost_perf': [344560.0, 334580.0, 334523.0, 556760.0, 124660.0, 124660.0, 763660.0]}
df2 = pd.DataFrame(data2)
# identify rows in df1 where day_of_week == 7
mask = df1['day_of_week'] == 7
# update df2 with cost_eu from df1 where there is a match
for i, row in df1[mask].iterrows():
matching_rows = df2[(df2['statWeek'] == row['statWeek']) & (df2['statMonth'] == row['statMonth'])]
if not matching_rows.empty:
df2.loc[matching_rows.index, 'as_cost_perf'] = row['cost_eu']
# print the updated df2
df2
Thanks in advance!
You can merge
or update
but first we need filter df1
since you only care about day_of_week == 7
by doing df1.loc[df1['day_of_week'].eq(7), 'statWeek':]
merge
df2.merge(df1.loc[df1['day_of_week'].eq(7), 'statWeek':],
on=['statWeek', 'statMonth'], how='left')
statWeek statMonth as_cost_perf cost_eu
0 1 1 344560.0 957940.0
1 2 1 334580.0 942553.0
2 3 1 334523.0 NaN
3 4 1 556760.0 NaN
4 1 2 124660.0 NaN
5 2 2 124660.0 NaN
6 3 2 763660.0 NaN
update
# we need to set the index if we use update
df2 = df2.set_index(['statWeek', 'statMonth'])
# we set the index for df1.loc[...] and rename the cost_eu column to match df2
df2.update(df1.loc[df1['day_of_week'].eq(7), 'statWeek':]
.set_index(['statWeek', 'statMonth']).rename(columns={'cost_eu': 'as_cost_perf'}))
print(df2.reset_index())
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
2 3 1 334523.0
3 4 1 556760.0
4 1 2 124660.0
5 2 2 124660.0
6 3 2 763660.0
Instead of for
loop you can apply df.merge
with single reassignment:
mask = df1['day_of_week'] == 7
df2 = df2.merge(df1[mask], on=['statWeek', 'statMonth'], how='left')
matched = ~df2['cost_eu'].isna()
df2.loc[matched, 'as_cost_perf'] = df2.loc[matched, 'cost_eu']
df2.drop(['day_of_week', 'cost_eu'], axis=1, inplace=True)
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
2 3 1 334523.0
3 4 1 556760.0
4 1 2 124660.0
5 2 2 124660.0
6 3 2 763660.0
You can reformat df1
and concatenate it with df2
then drop duplicates:
upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')
out = pd.concat([upd, df2], axis=0).drop_duplicates(['statWeek', 'statMonth'])
To avoid drop_duplicates
, you can simply remove the same rows from df2
:
upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')
cols = ['statWeek', 'statMonth']
m = ~df2[cols].isin(upd[cols]).all(axis=1)
out = pd.concat([upd, df2.loc[m]], axis=0)
Output:
>>> out
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0
2 3 1 334523.0
3 4 1 556760.0
4 1 2 124660.0
5 2 2 124660.0
6 3 2 763660.0
>>> upd
statWeek statMonth as_cost_perf
0 1 1 957940.0
1 2 1 942553.0