Comparison between dataframes without loops

Question:

I have 2 Data Frames structured like those:

Df1:

gene ids Go terms
ID1 GO1
ID1 GO2
ID2 GO1
ID2 GO3
ID3 GO1
ID4 GO1

Df2:

MP terms MP names gene ids
MP1 Name1 ID1, ID2, ID4
MP2 Name2 ID1,ID3
MP3 Name3 ID2

Now I would like to create a third Data Frame combining the previous two in this way:

Df3:

gene ids Mp terms GO terms
ID1 MP1,MP2 GO1,GO2
ID2 MP1,MP3 GO1,GO3
ID3 MP2 GO1
ID4 MP1 GO1

So in the first column there are the ids from Df1 (without repetitions), in the second column the mp from Df2 associated to the ids and in the third column the go terms from Df1 associated to the gene ids.
I would be able to do it with a for loop with nested loops but I know it would be a very inefficient approach. I would like to know how to do it avoiding loops.
Thank you very much for the help.

Asked By: Andrea

||

Answers:

Use concat with aggregation by GroupBy.agg and join with DataFrame.explode splitted values by , in df2:

df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
                   .explode('gene ids')
                   .groupby('gene ids')['MP terms'].agg(', '.join),
                df1.groupby('gene ids')['Go terms'].agg(', '.join)], axis=1).reset_index()
print (df)
  gene ids  MP terms  Go terms
0      ID1  MP1, MP2  GO1, GO2
1      ID2  MP1, MP3  GO1, GO3
2      ID3       MP2       GO1
3      ID4       MP1       GO1

If need aggregate all columns by join use:

df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
                   .explode('gene ids')
                   .groupby('gene ids').agg(', '.join),
                df1.groupby('gene ids').agg(', '.join)], axis=1).reset_index()
print (df)
  gene ids  MP terms      MP names  Go terms
0      ID1  MP1, MP2  Name1, Name2  GO1, GO2
1      ID2  MP1, MP3  Name1, Name3  GO1, GO3
2      ID3       MP2         Name2       GO1
3      ID4       MP1         Name1       GO1
Answered By: jezrael

You can use groupby.agg to join the rows with a common ID as string, and split+explode to expand to multiple rows. Finally merge the two parts to align your output:

out = (
 df1.groupby('gene ids', as_index=False).agg(','.join)
    .merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
               .groupby('gene ids', as_index=False).agg(', '.join)
            ), how='left')
)

Output:

  gene ids Go terms  MP terms      MP names
0      ID1  GO1,GO2  MP1, MP2  Name1, Name2
1      ID2  GO1,GO3  MP1, MP3  Name1, Name3
2      ID3      GO1       MP2         Name2
3      ID4      GO1       MP1         Name1

If you’re not interested in the "MP names" column, slice in the second groupby.agg:

out = (
 df1.groupby('gene ids', as_index=False).agg(','.join)
    .merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
               .groupby('gene ids', as_index=False)['MP terms'].agg(', '.join)
            ), how='left')
)

Output:

  gene ids Go terms  MP terms
0      ID1  GO1,GO2  MP1, MP2
1      ID2  GO1,GO3  MP1, MP3
2      ID3      GO1       MP2
3      ID4      GO1       MP1
Answered By: mozway