Finding value in a column and delete the value if not found in pandas

Question:

Comparing the columns df1_user and df2_user. If multiple values of df2_user is found in df1_user then merge those values in a new column with a delimiter.
I have a dataframe (df1) look like this:

df1:

df1_ID  df1_Tag df1_Info    df1_user
   1    Test1   Pass        100,200,300
   2    Test2   Pass        400,500,600,700
   3    Test3   Fail        800,900,1000,1100,1200
   4    Test4   Pass        1300
   5    Test5   Pass        1400,1500

df2:

df2_user
100
300
500
600
700
1100
1200
1300
1400
1600

The result dataframe should look like this:

df3:

df1_ID  df1_Tag df1_Info    df3_user
1        Test1  Pass         100,300
2        Test2  Pass         500,600,700
3        Test3  Fail         1100,1200
4        Test4  Pass         1300
5        Test5  Pass         1400

The code looks like this:

for name in df2['df2_user'].to_list():
    df1.loc[ df1['df1_user'].str.contains(name), 'df3_user' ] = name

I am having issues merging the multiple values with a delimiter.

Asked By: soosa

||

Answers:

Use custom lambda function for test if match splitted values in set from column df2['df2_user']:

s = set(df2['df2_user'].astype(str))

f = lambda x: (','.join(y for y in x.split(',') if y in s))
df1['df3_user'] = df1['df1_user'].apply(f)
print (df1)
   df1_ID df1_Tag df1_Info                df1_user     df3_user
0       1   Test1     Pass             100,200,300      100,300
1       2   Test2     Pass         400,500,600,700  500,600,700
2       3   Test3     Fail  800,900,1000,1100,1200    1100,1200
3       4   Test4     Pass                    1300         1300
4       5   Test5     Pass               1400,1500         1400

Or use DataFrame.explode with filter rows by Series.isin, last join back by GroupBy.agg with join:

df1['df3_user'] = (df1.assign(df1_user = df1['df1_user'].str.split(','))
                      .explode('df1_user')
                      .loc[lambda x: x['df1_user'].isin(df2['df2_user'].astype(str))]
                      .groupby(level=0)['df1_user']
                      .agg(','.join))
print (df1)
   df1_ID df1_Tag df1_Info                df1_user     df3_user
0       1   Test1     Pass             100,200,300      100,300
1       2   Test2     Pass         400,500,600,700  500,600,700
2       3   Test3     Fail  800,900,1000,1100,1200    1100,1200
3       4   Test4     Pass                    1300         1300
4       5   Test5     Pass               1400,1500         1400
Answered By: jezrael
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.