Comparison between dataframes without loops

Question

I have 2 Data Frames structured like those:

Df1:

gene ids	Go terms
ID1	GO1
ID1	GO2
ID2	GO1
ID2	GO3
ID3	GO1
ID4	GO1

Df2:

MP terms	MP names	gene ids
MP1	Name1	ID1, ID2, ID4
MP2	Name2	ID1,ID3
MP3	Name3	ID2

Now I would like to create a third Data Frame combining the previous two in this way:

Df3:

gene ids	Mp terms	GO terms
ID1	MP1,MP2	GO1,GO2
ID2	MP1,MP3	GO1,GO3
ID3	MP2	GO1
ID4	MP1	GO1

So in the first column there are the ids from Df1 (without repetitions), in the second column the mp from Df2 associated to the ids and in the third column the go terms from Df1 associated to the gene ids.
I would be able to do it with a for loop with nested loops but I know it would be a very inefficient approach. I would like to know how to do it avoiding loops.
Thank you very much for the help.

Asked By: Andrea

||

Source

Answer 1

Use concat with aggregation by GroupBy.agg and join with DataFrame.explode splitted values by , in df2:

df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
                   .explode('gene ids')
                   .groupby('gene ids')['MP terms'].agg(', '.join),
                df1.groupby('gene ids')['Go terms'].agg(', '.join)], axis=1).reset_index()
print (df)
  gene ids  MP terms  Go terms
0      ID1  MP1, MP2  GO1, GO2
1      ID2  MP1, MP3  GO1, GO3
2      ID3       MP2       GO1
3      ID4       MP1       GO1

If need aggregate all columns by join use:

df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
                   .explode('gene ids')
                   .groupby('gene ids').agg(', '.join),
                df1.groupby('gene ids').agg(', '.join)], axis=1).reset_index()
print (df)
  gene ids  MP terms      MP names  Go terms
0      ID1  MP1, MP2  Name1, Name2  GO1, GO2
1      ID2  MP1, MP3  Name1, Name3  GO1, GO3
2      ID3       MP2         Name2       GO1
3      ID4       MP1         Name1       GO1

Answered By: jezrael

Answer 2

You can use groupby.agg to join the rows with a common ID as string, and split+explode to expand to multiple rows. Finally merge the two parts to align your output:

out = (
 df1.groupby('gene ids', as_index=False).agg(','.join)
    .merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
               .groupby('gene ids', as_index=False).agg(', '.join)
            ), how='left')
)

Output:

  gene ids Go terms  MP terms      MP names
0      ID1  GO1,GO2  MP1, MP2  Name1, Name2
1      ID2  GO1,GO3  MP1, MP3  Name1, Name3
2      ID3      GO1       MP2         Name2
3      ID4      GO1       MP1         Name1

If you’re not interested in the "MP names" column, slice in the second groupby.agg:

out = (
 df1.groupby('gene ids', as_index=False).agg(','.join)
    .merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
               .groupby('gene ids', as_index=False)['MP terms'].agg(', '.join)
            ), how='left')
)

Output:

  gene ids Go terms  MP terms
0      ID1  GO1,GO2  MP1, MP2
1      ID2  GO1,GO3  MP1, MP3
2      ID3      GO1       MP2
3      ID4      GO1       MP1

Answered By: mozway

Comparison between dataframes without loops

Question:

Answers: