How to compare 2 dataframe's columns and change the existing columns occordingly in python
Question:
I have 2 dataframes df1, df2 as shown below and the required output is also df1 as shown in df1_output.
Here, only df1’s change_date columns need to be changed.
In my real use case, I have around 10 indicator columns to compare but both df1 and df2 are with around 500 rows (small df’s)
dict_1 = {'customer_id': [1,2,3,4,5,6],
'service_id_ind': ['n','y','n','y','n','y'],
'service_ind_change_date':['1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100'],
'nar_id_ind':['n','n','n','n','n','n'],
'nar_id_ind_change_date':['1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100']}
df1 = pd.DataFrame(dict_1, columns = ['customer_id','service_id_ind','service_ind_change_date','nar_id_ind','nar_id_ind_change_date'])
df1
dict_2 = {'customer_id': [1,2,3,4,5,6],
'service_id_ind': ['n','y','y','y','n','n'],
'nar_id_ind':['n','y','y','y','y','y']}
df2 = pd.DataFrame(dict_2, columns = ['customer_id','service_id_ind','nar_id_ind'])
df2
For any customer_id in df2 id_ind is changed, then in df1 the respective change_date column record for that customer_id should change to today_date.
dict_output = {'customer_id': [1,2,3,4,5,6],
'service_id_ind': ['n','y','n','y','n','y'],
'service_ind_change_date':['1/1/2100','1/1/2100','today_date','1/1/2100','1/1/2100','today_date'],
'nar_id_ind':['n','n','n','n','n','n'],
'nar_id_ind_change_date':['1/1/2100','today_date','today_date','today_date','today_date','today_date']}
df1_output = pd.DataFrame(dict_output, columns = ['customer_id','service_id_ind','service_ind_change_date','nar_id_ind','nar_id_ind_change_date'])
df1_output
Please suggest an optimized way to code this.
Answers:
EDIT after request for different number of rows:
import datetime as dt
df1 = df1.rename(columns={'service_ind_change_date': 'service_id_ind_change_date'}) # change column name to make logic automatic
check_cols = df1.columns.intersection(df2.columns).delete(0) # Index(['service_id_ind', 'nar_id_ind'], dtype='object')
keep_cols = df1.columns
df1 = df1.merge(df2.add_suffix('_2'), left_on=['customer_id'], right_on=['customer_id_2'], how='left')
for column in check_cols:
df1.loc[(df1[column] != df1[f'{column}_2']) & (~df1[f'{column}_2'].isna()), f'{column}_change_date'] = dt.datetime.strftime(dt.datetime.today(), "%d/%m/%Y")
df1 = df1[keep_cols]
This should work if the df2 has different number of customers. It depends on customers having the same customer_id, obviously. In both cases (whichever has more customers), missing customers won’t be updated.
The changes are:
keep_cols + df1.merge is to combine the two tables but to cut back to df1 in the end. I’m adding suffix because I don’t like the _x _y suffixes added automatically.
Logic has added ~df1.isna()
which returns only those rows where that column is not nan
.
EDIT after additional comments.
If the logic is to reset the date to today when corresponding value change, then this should be the clearest way forward.
import datetime as dt
df1 = df1.rename(columns={'service_ind_change_date': 'service_id_ind_change_date'}) # change column name to make logic automatic
check_cols = df1.columns.intersection(df2.columns).delete(0) # Index(['service_id_ind', 'nar_id_ind'], dtype='object')
for column in check_cols:
df1.loc[df1[column] != df2[column], f'{column}_change_date'] = dt.datetime.strftime(dt.datetime.today(), "%d/%m/%Y")
.intersection is to get the columns that show on both, and then remove the customer_id.
.loc is to select only the rows in df1 where df1 value is not the same as df2 value, and then update it with the time. Of course, you can then format the time depending on what you want, this is an example from the data.
My understanding of the problem: update service_ind_change_date
(and other variables similarly ) in df_1
to today's date
if the corresponding service_id_ind
(and other variables similarly) in df_2
is 'y'
.
This would probably be improved if you can guarantee that they have the same indices.
I chose to use np.where
, which uses the format np.where(condition, response if true, response if false)
- It gets a list of the
customer_id
s in df2
where the id_ind
is 'y'
: list(df2[df2.service_id_ind == 'y'].customer_id)
- Then checks if the
customer_id
in df1
is in this list: df1.customer_id.isin()
- If true, fill in
todays_date
- If false, keep current value
df1.service_ind_change_date
from datetime import date
todays_date = date.today().strftime("%m/%d/%y")
df1['service_ind_change_date'] = np.where(df1.customer_id.isin(list(df2[df2.service_id_ind == 'y'].customer_id)), todays_date , df1.service_ind_change_date)
df1['service_id_ind '] = np.where(df1.service_ind_change_date == todays_date), 'y', 'n')
df1['nar_id_ind_change_date'] = np.where(df1.customer_id.isin(list(df2[df2.nar_id_ind== 'y'].customer_id)), todays_date , df1.nar_id_ind_change_date)
df1['nar_id_ind'] = np.where(df1.nar_id_ind_change_date== todays_date), 'y', 'n')
Update with your request to change based on if the ind column changes, not if it is y
or n
If your column names are standard you can do this without writing each out.
Imagine they all take the form {var}_id_ind_change_date
& {var}_id_ind
similar to nar_id_ind
& nar_id_ind_change_date
#make standard col names
df1.rename(columns = {'service_ind_change_date': 'service_id_ind_change_date'}, inplace = True)
cols_to_use = list(df1.columns.difference(df2.columns))
cols_to_use.append('customer_id')
updated_df = df2.merge(df1, on = 'customer_id')
cols_var = list(df1.columns.difference(df2.columns))
cols_ind = [i.replace('_change_date', '') for i in cols_var]
for i in np.arange(len(cols_var)):
updated_df[f'{cols_var[i]}'] = np.where(updated_df[f'{cols_ind[i]}_x'] !=updated_df[f'{cols_ind[i]}_y'], todays_date, updated_df[f'{cols_var[i]}'])
If you want to keep df1
‘s ind
like you do in the example, drop the other ind
column and rename like so (again, columns need to be standard in form described):
updated_df.drop(columns = [i+'_x' for i in cols_ind], inplace = True)
updated_df.rename(columns = {i+'_y': i for i in cols_ind}, inplace = True)
This should match your exact output given
I have 2 dataframes df1, df2 as shown below and the required output is also df1 as shown in df1_output.
Here, only df1’s change_date columns need to be changed.
In my real use case, I have around 10 indicator columns to compare but both df1 and df2 are with around 500 rows (small df’s)
dict_1 = {'customer_id': [1,2,3,4,5,6],
'service_id_ind': ['n','y','n','y','n','y'],
'service_ind_change_date':['1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100'],
'nar_id_ind':['n','n','n','n','n','n'],
'nar_id_ind_change_date':['1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100']}
df1 = pd.DataFrame(dict_1, columns = ['customer_id','service_id_ind','service_ind_change_date','nar_id_ind','nar_id_ind_change_date'])
df1
dict_2 = {'customer_id': [1,2,3,4,5,6],
'service_id_ind': ['n','y','y','y','n','n'],
'nar_id_ind':['n','y','y','y','y','y']}
df2 = pd.DataFrame(dict_2, columns = ['customer_id','service_id_ind','nar_id_ind'])
df2
For any customer_id in df2 id_ind is changed, then in df1 the respective change_date column record for that customer_id should change to today_date.
dict_output = {'customer_id': [1,2,3,4,5,6],
'service_id_ind': ['n','y','n','y','n','y'],
'service_ind_change_date':['1/1/2100','1/1/2100','today_date','1/1/2100','1/1/2100','today_date'],
'nar_id_ind':['n','n','n','n','n','n'],
'nar_id_ind_change_date':['1/1/2100','today_date','today_date','today_date','today_date','today_date']}
df1_output = pd.DataFrame(dict_output, columns = ['customer_id','service_id_ind','service_ind_change_date','nar_id_ind','nar_id_ind_change_date'])
df1_output
Please suggest an optimized way to code this.
EDIT after request for different number of rows:
import datetime as dt
df1 = df1.rename(columns={'service_ind_change_date': 'service_id_ind_change_date'}) # change column name to make logic automatic
check_cols = df1.columns.intersection(df2.columns).delete(0) # Index(['service_id_ind', 'nar_id_ind'], dtype='object')
keep_cols = df1.columns
df1 = df1.merge(df2.add_suffix('_2'), left_on=['customer_id'], right_on=['customer_id_2'], how='left')
for column in check_cols:
df1.loc[(df1[column] != df1[f'{column}_2']) & (~df1[f'{column}_2'].isna()), f'{column}_change_date'] = dt.datetime.strftime(dt.datetime.today(), "%d/%m/%Y")
df1 = df1[keep_cols]
This should work if the df2 has different number of customers. It depends on customers having the same customer_id, obviously. In both cases (whichever has more customers), missing customers won’t be updated.
The changes are:
keep_cols + df1.merge is to combine the two tables but to cut back to df1 in the end. I’m adding suffix because I don’t like the _x _y suffixes added automatically.
Logic has added ~df1.isna()
which returns only those rows where that column is not nan
.
EDIT after additional comments.
If the logic is to reset the date to today when corresponding value change, then this should be the clearest way forward.
import datetime as dt
df1 = df1.rename(columns={'service_ind_change_date': 'service_id_ind_change_date'}) # change column name to make logic automatic
check_cols = df1.columns.intersection(df2.columns).delete(0) # Index(['service_id_ind', 'nar_id_ind'], dtype='object')
for column in check_cols:
df1.loc[df1[column] != df2[column], f'{column}_change_date'] = dt.datetime.strftime(dt.datetime.today(), "%d/%m/%Y")
.intersection is to get the columns that show on both, and then remove the customer_id.
.loc is to select only the rows in df1 where df1 value is not the same as df2 value, and then update it with the time. Of course, you can then format the time depending on what you want, this is an example from the data.
My understanding of the problem: update service_ind_change_date
(and other variables similarly ) in df_1
to today's date
if the corresponding service_id_ind
(and other variables similarly) in df_2
is 'y'
.
This would probably be improved if you can guarantee that they have the same indices.
I chose to use np.where
, which uses the format np.where(condition, response if true, response if false)
- It gets a list of the
customer_id
s indf2
where theid_ind
is'y'
:list(df2[df2.service_id_ind == 'y'].customer_id)
- Then checks if the
customer_id
indf1
is in this list:df1.customer_id.isin()
- If true, fill in
todays_date
- If false, keep current value
df1.service_ind_change_date
from datetime import date
todays_date = date.today().strftime("%m/%d/%y")
df1['service_ind_change_date'] = np.where(df1.customer_id.isin(list(df2[df2.service_id_ind == 'y'].customer_id)), todays_date , df1.service_ind_change_date)
df1['service_id_ind '] = np.where(df1.service_ind_change_date == todays_date), 'y', 'n')
df1['nar_id_ind_change_date'] = np.where(df1.customer_id.isin(list(df2[df2.nar_id_ind== 'y'].customer_id)), todays_date , df1.nar_id_ind_change_date)
df1['nar_id_ind'] = np.where(df1.nar_id_ind_change_date== todays_date), 'y', 'n')
Update with your request to change based on if the ind column changes, not if it is y
or n
If your column names are standard you can do this without writing each out.
Imagine they all take the form {var}_id_ind_change_date
& {var}_id_ind
similar to nar_id_ind
& nar_id_ind_change_date
#make standard col names
df1.rename(columns = {'service_ind_change_date': 'service_id_ind_change_date'}, inplace = True)
cols_to_use = list(df1.columns.difference(df2.columns))
cols_to_use.append('customer_id')
updated_df = df2.merge(df1, on = 'customer_id')
cols_var = list(df1.columns.difference(df2.columns))
cols_ind = [i.replace('_change_date', '') for i in cols_var]
for i in np.arange(len(cols_var)):
updated_df[f'{cols_var[i]}'] = np.where(updated_df[f'{cols_ind[i]}_x'] !=updated_df[f'{cols_ind[i]}_y'], todays_date, updated_df[f'{cols_var[i]}'])
If you want to keep df1
‘s ind
like you do in the example, drop the other ind
column and rename like so (again, columns need to be standard in form described):
updated_df.drop(columns = [i+'_x' for i in cols_ind], inplace = True)
updated_df.rename(columns = {i+'_y': i for i in cols_ind}, inplace = True)
This should match your exact output given