Check if comma separated values in a dataframe contains values from another dataframe in python and add corresponding value
Question:
I have 2 dataframes that looks like this (my origianl dataset is huge):
df1:
gene_callers_id
0 4717766,4743899,11597717,12116240
1 4717766,4743899,12116240,7719716,4022000
2 4717766,4743899,12116240,7248697,7719716
df2:
gene_callers_id sample_1 sample_2
0 4743899 0.345000 0.176000
1 4717766 0.000000 2.500000
2 4743898 0.000000 0.684982
Im trying to check if the comma separated values in each rows of df1 matches the first column of df2 and if it does add the corresponding values for each sample from df2 to df1 and then calculate it average. My apologies if i sound very confusing, but here is the output im looking for:
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| gene_callers_id | sample_1 | sample_1_avg |sample_2 |sample2_avg|
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,11597717,12116240 | 0,0.345000,0,0 | 0.08635 |2.500,0.684982,0,0 |0.79 |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,12116240,7719716,4022000 | 0,0,0,0,0 | 0 |2.500,0.684982,0,0,0 |0.64 |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,12116240,7248697,4743898 | 0,0,0,0,0.345000| 0.06900 |2.500,0.1760,0,0,0.684982 |0.67 |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
Did appreciate some help. Thank you.
Answers:
Use:
#split , separate values to columns
df = df1['gene_callers_id'].str.split(',', expand=True)
#reshape rows by melt and merge df2 with remove duplicates, last aggregate
f = lambda x: ','.join(x.astype(str))
df = (df1.join(df.melt(ignore_index=False, value_name='gene_callers_id')
.reset_index()
.dropna(subset=['gene_callers_id'])
.merge(df2.assign(gene_callers_id=df2['gene_callers_id'].astype(str))
.drop_duplicates(['gene_callers_id']), how='left')
.fillna(0)
.groupby('index').agg(sample_1=('sample_1', f),
sample_1_avg=('sample_1','mean'),
sample_2=('sample_2', f),
sample_2_avg=('sample_2','mean')))
)
Another solution for multiple sample
columns:
df = df1['gene_callers_id'].str.split(',', expand=True)
f = lambda x: ','.join(x.astype(str))
df = (df1.join(df.melt(ignore_index=False, value_name='gene_callers_id')
.reset_index()
.dropna(subset=['gene_callers_id'])
.merge(df2.assign(gene_callers_id=df2['gene_callers_id'].astype(str))
.drop_duplicates(['gene_callers_id']), how='left')
.fillna(0)
.drop('variable', axis=1)
.groupby('index').agg([('',f), ('_avg','mean')])
.pipe(lambda x: x.set_axis([''.join(x) for x in x.columns], axis=1)))
)
print (df)
gene_callers_id sample_1
0 4717766,4743899,11597717,12116240 0.0,0.345,0.0,0.0
1 4717766,4743899,12116240,7719716,4022000 0.0,0.345,0.0,0.0,0.0
2 4717766,4743899,12116240,7248697,4743898 0.0,0.345,0.0,0.0,0.0
sample_1_avg sample_2 sample_2_avg
0 0.08625 2.5,0.176,0.0,0.0 0.669000
1 0.06900 2.5,0.176,0.0,0.0,0.0 0.535200
2 0.06900 2.5,0.176,0.0,0.0,0.684982 0.672196
I have 2 dataframes that looks like this (my origianl dataset is huge):
df1:
gene_callers_id
0 4717766,4743899,11597717,12116240
1 4717766,4743899,12116240,7719716,4022000
2 4717766,4743899,12116240,7248697,7719716
df2:
gene_callers_id sample_1 sample_2
0 4743899 0.345000 0.176000
1 4717766 0.000000 2.500000
2 4743898 0.000000 0.684982
Im trying to check if the comma separated values in each rows of df1 matches the first column of df2 and if it does add the corresponding values for each sample from df2 to df1 and then calculate it average. My apologies if i sound very confusing, but here is the output im looking for:
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| gene_callers_id | sample_1 | sample_1_avg |sample_2 |sample2_avg|
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,11597717,12116240 | 0,0.345000,0,0 | 0.08635 |2.500,0.684982,0,0 |0.79 |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,12116240,7719716,4022000 | 0,0,0,0,0 | 0 |2.500,0.684982,0,0,0 |0.64 |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
| 4717766,4743899,12116240,7248697,4743898 | 0,0,0,0,0.345000| 0.06900 |2.500,0.1760,0,0,0.684982 |0.67 |
+------------------------------------------+-----------------+--------------+---------------------------+-----------+
Did appreciate some help. Thank you.
Use:
#split , separate values to columns
df = df1['gene_callers_id'].str.split(',', expand=True)
#reshape rows by melt and merge df2 with remove duplicates, last aggregate
f = lambda x: ','.join(x.astype(str))
df = (df1.join(df.melt(ignore_index=False, value_name='gene_callers_id')
.reset_index()
.dropna(subset=['gene_callers_id'])
.merge(df2.assign(gene_callers_id=df2['gene_callers_id'].astype(str))
.drop_duplicates(['gene_callers_id']), how='left')
.fillna(0)
.groupby('index').agg(sample_1=('sample_1', f),
sample_1_avg=('sample_1','mean'),
sample_2=('sample_2', f),
sample_2_avg=('sample_2','mean')))
)
Another solution for multiple sample
columns:
df = df1['gene_callers_id'].str.split(',', expand=True)
f = lambda x: ','.join(x.astype(str))
df = (df1.join(df.melt(ignore_index=False, value_name='gene_callers_id')
.reset_index()
.dropna(subset=['gene_callers_id'])
.merge(df2.assign(gene_callers_id=df2['gene_callers_id'].astype(str))
.drop_duplicates(['gene_callers_id']), how='left')
.fillna(0)
.drop('variable', axis=1)
.groupby('index').agg([('',f), ('_avg','mean')])
.pipe(lambda x: x.set_axis([''.join(x) for x in x.columns], axis=1)))
)
print (df)
gene_callers_id sample_1
0 4717766,4743899,11597717,12116240 0.0,0.345,0.0,0.0
1 4717766,4743899,12116240,7719716,4022000 0.0,0.345,0.0,0.0,0.0
2 4717766,4743899,12116240,7248697,4743898 0.0,0.345,0.0,0.0,0.0
sample_1_avg sample_2 sample_2_avg
0 0.08625 2.5,0.176,0.0,0.0 0.669000
1 0.06900 2.5,0.176,0.0,0.0,0.0 0.535200
2 0.06900 2.5,0.176,0.0,0.0,0.684982 0.672196