How to compare 2 dataframe's columns and change the existing columns occordingly in python

Question:

I have 2 dataframes df1, df2 as shown below and the required output is also df1 as shown in df1_output.

Here, only df1’s change_date columns need to be changed.

In my real use case, I have around 10 indicator columns to compare but both df1 and df2 are with around 500 rows (small df’s)

dict_1 = {'customer_id': [1,2,3,4,5,6],
          'service_id_ind': ['n','y','n','y','n','y'],
          'service_ind_change_date':['1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100'], 
          'nar_id_ind':['n','n','n','n','n','n'],
         'nar_id_ind_change_date':['1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100','1/1/2100']}
df1 = pd.DataFrame(dict_1, columns = ['customer_id','service_id_ind','service_ind_change_date','nar_id_ind','nar_id_ind_change_date'])
df1
dict_2 = {'customer_id': [1,2,3,4,5,6],
          'service_id_ind': ['n','y','y','y','n','n'],
          'nar_id_ind':['n','y','y','y','y','y']}
df2 = pd.DataFrame(dict_2, columns = ['customer_id','service_id_ind','nar_id_ind'])
df2

For any customer_id in df2 id_ind is changed, then in df1 the respective change_date column record for that customer_id should change to today_date.

dict_output = {'customer_id': [1,2,3,4,5,6],
          'service_id_ind': ['n','y','n','y','n','y'],
          'service_ind_change_date':['1/1/2100','1/1/2100','today_date','1/1/2100','1/1/2100','today_date'], 
          'nar_id_ind':['n','n','n','n','n','n'],
         'nar_id_ind_change_date':['1/1/2100','today_date','today_date','today_date','today_date','today_date']}
df1_output = pd.DataFrame(dict_output, columns = ['customer_id','service_id_ind','service_ind_change_date','nar_id_ind','nar_id_ind_change_date'])
df1_output

Please suggest an optimized way to code this.

Asked By: rabasa97

||

Answers:

EDIT after request for different number of rows:

import datetime as dt
df1 = df1.rename(columns={'service_ind_change_date': 'service_id_ind_change_date'})  # change column name to make logic automatic
check_cols = df1.columns.intersection(df2.columns).delete(0)  # Index(['service_id_ind', 'nar_id_ind'], dtype='object')

keep_cols = df1.columns
df1 = df1.merge(df2.add_suffix('_2'), left_on=['customer_id'], right_on=['customer_id_2'], how='left')

for column in check_cols:
    df1.loc[(df1[column] != df1[f'{column}_2']) & (~df1[f'{column}_2'].isna()), f'{column}_change_date'] = dt.datetime.strftime(dt.datetime.today(), "%d/%m/%Y")

df1 = df1[keep_cols]

This should work if the df2 has different number of customers. It depends on customers having the same customer_id, obviously. In both cases (whichever has more customers), missing customers won’t be updated.

The changes are:
keep_cols + df1.merge is to combine the two tables but to cut back to df1 in the end. I’m adding suffix because I don’t like the _x _y suffixes added automatically.
Logic has added ~df1.isna() which returns only those rows where that column is not nan.


EDIT after additional comments.

If the logic is to reset the date to today when corresponding value change, then this should be the clearest way forward.

import datetime as dt
df1 = df1.rename(columns={'service_ind_change_date': 'service_id_ind_change_date'})  # change column name to make logic automatic
check_cols = df1.columns.intersection(df2.columns).delete(0)  # Index(['service_id_ind', 'nar_id_ind'], dtype='object')
for column in check_cols:
    df1.loc[df1[column] != df2[column], f'{column}_change_date'] = dt.datetime.strftime(dt.datetime.today(), "%d/%m/%Y")

.intersection is to get the columns that show on both, and then remove the customer_id.
.loc is to select only the rows in df1 where df1 value is not the same as df2 value, and then update it with the time. Of course, you can then format the time depending on what you want, this is an example from the data.

Answered By: thevoiddancer

My understanding of the problem: update service_ind_change_date (and other variables similarly ) in df_1 to today's date if the corresponding service_id_ind (and other variables similarly) in df_2 is 'y'.

This would probably be improved if you can guarantee that they have the same indices.

I chose to use np.where, which uses the format np.where(condition, response if true, response if false)

  • It gets a list of the customer_ids in df2 where the id_ind is 'y': list(df2[df2.service_id_ind == 'y'].customer_id)
  • Then checks if the customer_id in df1 is in this list: df1.customer_id.isin()
  • If true, fill in todays_date
  • If false, keep current value df1.service_ind_change_date
from datetime import date

todays_date = date.today().strftime("%m/%d/%y")

df1['service_ind_change_date'] = np.where(df1.customer_id.isin(list(df2[df2.service_id_ind == 'y'].customer_id)), todays_date , df1.service_ind_change_date)
df1['service_id_ind '] = np.where(df1.service_ind_change_date == todays_date), 'y', 'n')

df1['nar_id_ind_change_date'] = np.where(df1.customer_id.isin(list(df2[df2.nar_id_ind== 'y'].customer_id)), todays_date , df1.nar_id_ind_change_date)
df1['nar_id_ind'] = np.where(df1.nar_id_ind_change_date== todays_date), 'y', 'n')

Update with your request to change based on if the ind column changes, not if it is y or n

If your column names are standard you can do this without writing each out.

Imagine they all take the form {var}_id_ind_change_date & {var}_id_ind similar to nar_id_ind & nar_id_ind_change_date

#make standard col names
df1.rename(columns = {'service_ind_change_date': 'service_id_ind_change_date'}, inplace = True)

cols_to_use = list(df1.columns.difference(df2.columns))
cols_to_use.append('customer_id')
updated_df = df2.merge(df1, on = 'customer_id')

cols_var = list(df1.columns.difference(df2.columns))
cols_ind = [i.replace('_change_date', '') for i in cols_var]

for i in np.arange(len(cols_var)):
    updated_df[f'{cols_var[i]}'] = np.where(updated_df[f'{cols_ind[i]}_x'] !=updated_df[f'{cols_ind[i]}_y'], todays_date, updated_df[f'{cols_var[i]}'])

If you want to keep df1‘s ind like you do in the example, drop the other ind column and rename like so (again, columns need to be standard in form described):

updated_df.drop(columns = [i+'_x' for i in cols_ind], inplace = True)
updated_df.rename(columns = {i+'_y': i for i in cols_ind}, inplace = True)

This should match your exact output given

Answered By: 34jbonz
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.