Function changes all column data of DF to the same value and ignores that conditions are different for different rows. And .Apply() gives Error

Question:

Please, give me an advice about how to .apply( ) in correct way to get correct result.

I have 2 Pandas dataframes with ‘a’,’b’,’c’ columns. And i want to change some ‘c’-column data of the second dataframe df_2. It is needed to change df_2 ‘c’-data for those rows only where ‘a’ is equal to 1. Zero 0 in those ‘c’-rows has to be changed to median value counted on those ‘c’-rows of the first dataframe df_1 where ‘a’ is 1. I wrote a function to do it. It is applied to df_2 and uses df_1 data.

The problem is:

  1. If function is applied like this: ‘df_2[‘c’] = set_c(df_1, df_2)’, all ‘c’-column of df_2 gets new price, no matter is ‘a’ == 1 or not. It is incorrect.

  2. If function is applied like this: ‘df_2[‘c’] = df_2.apply(set_c(df_2, df_1))’, error occures: ‘AssertionError:’ and no additional comments.

Code is:

import pandas as pd

df_1 = pd.DataFrame({'a': [1,2,1], 'b': [4,5,6], 'c': [7,100,9]}) # From C
df_2 = pd.DataFrame({'a': [1,2,3], 'b': [4,50,6], 'c': [0,0,0]}) # To C

display('df_1', df_1)
display('df_2', df_2)

def set_c(df1, df2):
    
    mask = ( df1.loc[:, 'a'] == df2.loc[0, 'a'] )
    final_c = df1[mask]['c'].median()
    
    display("df2.loc[0, 'a']", df2.loc[0, 'a'])
    display('df1[mask]', df1[mask])
    print('final_c median', final_c)
    
    return final_c

df_2['c'] = set_c(df_1, df_2)

display(df_2)

df_2 and df_1 are global dataframes outside the function, df2 and df1 are dataframes inside the function used as function parameters.

My function shows all the process of its working. For variant 1 it shows such steps:

'df_1'
    a   b   c
0   1   4   7
1   2   5   100
2   1   6   9

'df_2'
    a   b   c
0   1   4   0
1   2   50  0
2   3   6   0

"df2.loc[0, 'a']" # 'a'=1 is a mask base for counting 'c'-median on df_1 data
1
 
'df1[mask]' # Rows of df_1 with 'a'=1 were found!
    a   b   c
0   1   4   7
2   1   6   9

final_c median 8.0 # It is median between 7 and 9 of df1

'df_2 result'
    a   b   c
0   1   4   8.0
1   2   50  8.0
2   3   6   8.0

Could you please show me how is it correct to apply this function for it to give such a result of df_2 with new ‘c’=8.0 **in row [0], but in row[1] – 100 (‘a’=2), in row[2] – 0 (no ‘a’=3 in df_1):

'df_2 result'
    a   b   c
0   1   4   8.0
1   2   50  100.0
2   3   6   0.0

Thank you very much!

Asked By: newnewer

||

Answers:

If I got your meaning correctly, I suggest using the apply method with axis=1 and transferring to the function the index row, then using it as an index for the loc.

import pandas as pd

df_1 = pd.DataFrame({'a': [1, 2, 1], 'b': [4, 5, 6], 'c': [7, 100, 9]})  # From C
df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 50, 6], 'c': [0, 0, 0]})  # To C


def set_c(df1, df2, i):
    mask = (df1.loc[:, 'a'] == df2.loc[i, 'a'])
    final_c = df1[mask]['c'].median()

    return final_c


df_2['c'] = df_1.apply(lambda x: set_c(df_1, df_2, x.name), axis=1).fillna(0)
print(df_2)

OUTPUT:

   a   b      c
0  1   4    8.0
1  2  50  100.0
2  3   6    0.0
Answered By: Ze'ev Ben-Tsvi

DearFriends, if you know how to do the same without lambda show me, please. It will help me and other newers to understand difference between several aproaches and understand how to make correspondence of rows while working with two different dataframes and counting feature of the second dataframe on the base of features of the first one.
Thank you very much!

Answered By: newnewer