Check if comma separated values in a dataframe contains values from another dataframe in python and add corresponding value

Question:

I have 2 dataframes that looks like this (my origianl dataset is huge):

df1:

       gene_callers_id
0      4717766,4743899,11597717,12116240
1      4717766,4743899,12116240,7719716,4022000
2      4717766,4743899,12116240,7248697,7719716

df2:

        gene_callers_id  sample_1  sample_2  
0               4743899  0.345000  0.176000  
1               4717766  0.000000  2.500000  
2               4743898  0.000000  0.684982 

Im trying to check if the comma separated values in each rows of df1 matches the first column of df2 and if it does add the corresponding values for each sample from df2 to df1 and then calculate it average. My apologies if i sound very confusing, but here is the output im looking for:

+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| gene_callers_id                          | sample_1        | sample_1_avg |sample_2                   |sample2_avg|
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,11597717,12116240        | 0,0.345000,0,0  | 0.08635      |2.500,0.684982,0,0         |0.79       |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,12116240,7719716,4022000 | 0,0,0,0,0       | 0            |2.500,0.684982,0,0,0       |0.64       |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,12116240,7248697,4743898 | 0,0,0,0,0.345000| 0.06900      |2.500,0.1760,0,0,0.684982  |0.67       |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+

Did appreciate some help. Thank you.

Answers:

Use:

#split , separate values to columns
df = df1['gene_callers_id'].str.split(',', expand=True)

#reshape rows by melt and merge df2 with remove duplicates, last aggregate
f = lambda x: ','.join(x.astype(str))
df = (df1.join(df.melt(ignore_index=False, value_name='gene_callers_id')
                 .reset_index()
                 .dropna(subset=['gene_callers_id'])
                .merge(df2.assign(gene_callers_id=df2['gene_callers_id'].astype(str))
                          .drop_duplicates(['gene_callers_id']), how='left')
                .fillna(0)
                .groupby('index').agg(sample_1=('sample_1', f),
                                      sample_1_avg=('sample_1','mean'),
                                      sample_2=('sample_2', f),
                                      sample_2_avg=('sample_2','mean')))
        )

Another solution for multiple sample columns:

df = df1['gene_callers_id'].str.split(',', expand=True)

f = lambda x: ','.join(x.astype(str))
df = (df1.join(df.melt(ignore_index=False, value_name='gene_callers_id')
                .reset_index()
                .dropna(subset=['gene_callers_id'])
                .merge(df2.assign(gene_callers_id=df2['gene_callers_id'].astype(str))
                          .drop_duplicates(['gene_callers_id']), how='left')
                .fillna(0)
                .drop('variable', axis=1)
                .groupby('index').agg([('',f), ('_avg','mean')])
                .pipe(lambda x: x.set_axis([''.join(x) for x in x.columns], axis=1)))
        )
print (df)
                            gene_callers_id               sample_1  
0         4717766,4743899,11597717,12116240      0.0,0.345,0.0,0.0   
1  4717766,4743899,12116240,7719716,4022000  0.0,0.345,0.0,0.0,0.0   
2  4717766,4743899,12116240,7248697,4743898  0.0,0.345,0.0,0.0,0.0   

   sample_1_avg                    sample_2  sample_2_avg  
0       0.08625           2.5,0.176,0.0,0.0      0.669000  
1       0.06900       2.5,0.176,0.0,0.0,0.0      0.535200  
2       0.06900  2.5,0.176,0.0,0.0,0.684982      0.672196  
Answered By: jezrael
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.