Comparison between dataframes without loops
Question:
I have 2 Data Frames structured like those:
Df1:
gene ids
Go terms
ID1
GO1
ID1
GO2
ID2
GO1
ID2
GO3
ID3
GO1
ID4
GO1
Df2:
MP terms
MP names
gene ids
MP1
Name1
ID1, ID2, ID4
MP2
Name2
ID1,ID3
MP3
Name3
ID2
Now I would like to create a third Data Frame combining the previous two in this way:
Df3:
gene ids
Mp terms
GO terms
ID1
MP1,MP2
GO1,GO2
ID2
MP1,MP3
GO1,GO3
ID3
MP2
GO1
ID4
MP1
GO1
So in the first column there are the ids from Df1 (without repetitions), in the second column the mp from Df2 associated to the ids and in the third column the go terms from Df1 associated to the gene ids.
I would be able to do it with a for loop with nested loops but I know it would be a very inefficient approach. I would like to know how to do it avoiding loops.
Thank you very much for the help.
Answers:
Use concat
with aggregation by GroupBy.agg
and join
with DataFrame.explode
splitted values by ,
in df2
:
df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
.explode('gene ids')
.groupby('gene ids')['MP terms'].agg(', '.join),
df1.groupby('gene ids')['Go terms'].agg(', '.join)], axis=1).reset_index()
print (df)
gene ids MP terms Go terms
0 ID1 MP1, MP2 GO1, GO2
1 ID2 MP1, MP3 GO1, GO3
2 ID3 MP2 GO1
3 ID4 MP1 GO1
If need aggregate all columns by join
use:
df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
.explode('gene ids')
.groupby('gene ids').agg(', '.join),
df1.groupby('gene ids').agg(', '.join)], axis=1).reset_index()
print (df)
gene ids MP terms MP names Go terms
0 ID1 MP1, MP2 Name1, Name2 GO1, GO2
1 ID2 MP1, MP3 Name1, Name3 GO1, GO3
2 ID3 MP2 Name2 GO1
3 ID4 MP1 Name1 GO1
You can use groupby.agg
to join the rows with a common ID as string, and split
+explode
to expand to multiple rows. Finally merge
the two parts to align your output:
out = (
df1.groupby('gene ids', as_index=False).agg(','.join)
.merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
.groupby('gene ids', as_index=False).agg(', '.join)
), how='left')
)
Output:
gene ids Go terms MP terms MP names
0 ID1 GO1,GO2 MP1, MP2 Name1, Name2
1 ID2 GO1,GO3 MP1, MP3 Name1, Name3
2 ID3 GO1 MP2 Name2
3 ID4 GO1 MP1 Name1
If you’re not interested in the "MP names" column, slice in the second groupby.agg
:
out = (
df1.groupby('gene ids', as_index=False).agg(','.join)
.merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
.groupby('gene ids', as_index=False)['MP terms'].agg(', '.join)
), how='left')
)
Output:
gene ids Go terms MP terms
0 ID1 GO1,GO2 MP1, MP2
1 ID2 GO1,GO3 MP1, MP3
2 ID3 GO1 MP2
3 ID4 GO1 MP1
I have 2 Data Frames structured like those:
Df1:
gene ids | Go terms |
---|---|
ID1 | GO1 |
ID1 | GO2 |
ID2 | GO1 |
ID2 | GO3 |
ID3 | GO1 |
ID4 | GO1 |
Df2:
MP terms | MP names | gene ids |
---|---|---|
MP1 | Name1 | ID1, ID2, ID4 |
MP2 | Name2 | ID1,ID3 |
MP3 | Name3 | ID2 |
Now I would like to create a third Data Frame combining the previous two in this way:
Df3:
gene ids | Mp terms | GO terms |
---|---|---|
ID1 | MP1,MP2 | GO1,GO2 |
ID2 | MP1,MP3 | GO1,GO3 |
ID3 | MP2 | GO1 |
ID4 | MP1 | GO1 |
So in the first column there are the ids from Df1 (without repetitions), in the second column the mp from Df2 associated to the ids and in the third column the go terms from Df1 associated to the gene ids.
I would be able to do it with a for loop with nested loops but I know it would be a very inefficient approach. I would like to know how to do it avoiding loops.
Thank you very much for the help.
Use concat
with aggregation by GroupBy.agg
and join
with DataFrame.explode
splitted values by ,
in df2
:
df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
.explode('gene ids')
.groupby('gene ids')['MP terms'].agg(', '.join),
df1.groupby('gene ids')['Go terms'].agg(', '.join)], axis=1).reset_index()
print (df)
gene ids MP terms Go terms
0 ID1 MP1, MP2 GO1, GO2
1 ID2 MP1, MP3 GO1, GO3
2 ID3 MP2 GO1
3 ID4 MP1 GO1
If need aggregate all columns by join
use:
df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
.explode('gene ids')
.groupby('gene ids').agg(', '.join),
df1.groupby('gene ids').agg(', '.join)], axis=1).reset_index()
print (df)
gene ids MP terms MP names Go terms
0 ID1 MP1, MP2 Name1, Name2 GO1, GO2
1 ID2 MP1, MP3 Name1, Name3 GO1, GO3
2 ID3 MP2 Name2 GO1
3 ID4 MP1 Name1 GO1
You can use groupby.agg
to join the rows with a common ID as string, and split
+explode
to expand to multiple rows. Finally merge
the two parts to align your output:
out = (
df1.groupby('gene ids', as_index=False).agg(','.join)
.merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
.groupby('gene ids', as_index=False).agg(', '.join)
), how='left')
)
Output:
gene ids Go terms MP terms MP names
0 ID1 GO1,GO2 MP1, MP2 Name1, Name2
1 ID2 GO1,GO3 MP1, MP3 Name1, Name3
2 ID3 GO1 MP2 Name2
3 ID4 GO1 MP1 Name1
If you’re not interested in the "MP names" column, slice in the second groupby.agg
:
out = (
df1.groupby('gene ids', as_index=False).agg(','.join)
.merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
.groupby('gene ids', as_index=False)['MP terms'].agg(', '.join)
), how='left')
)
Output:
gene ids Go terms MP terms
0 ID1 GO1,GO2 MP1, MP2
1 ID2 GO1,GO3 MP1, MP3
2 ID3 GO1 MP2
3 ID4 GO1 MP1