Agregate two column values from a dataframe together in one column of another dataframe in Python

Question:

I have a dataframe with unique gene IDs and multiple columns like this

Table 1

geneID column 1 column 2 column n…
gene1
gene2
gene3
gene4

And another dataframe with unique GO IDs with more than one gene IDs

Table 2

GO_ID Description geneID
GO:58479 one biological description gene1/gene2/gene3
GO:58014 another description gene1
GO:37402 description 3 gene1/gene4/gene6
GO:28390 one more gene2

I don’t know if it’s possible but I would like to add a new column to the table 1 with all the GO IDs and descriptions for each gene. I want just a space bar separation between ID and description but another separation like "/" or ";" between different GO IDs like this:

Table 3

geneID GO column 1 column n…
gene1 GO:58479 one biological description; GO:58014 another description; GO:37402 description 3;…
gene2 GO:58479 one biological description; GO:28390 one more; …
gene3
gene4

Any idea how to solve it?

Asked By: Antuan Rage

||

Answers:

Here is a proposition with pandas.DataFrame.merge and pandas.DataFrame.explode :

out = (
        df1
          .merge(df2
                  .assign(geneID= df2["geneID"]
                                    .replace("gen", "gene ", regex=True) #read NB below
                                    .str.split("/"),
                          GO= df2[["GO_ID", "Description"]].agg(" ".join, axis=1))
                  .explode("geneID")
                  .groupby("geneID", as_index=False, sort=False)["GO"].agg("; ".join),
                 on="geneID", how="left")
          [["geneID", "GO"] + df1.columns[1:].tolist()]
      )

# Output :

print(out)

   geneID                                                                                         GO column 1 column 2  column n...
0  gene 1  GO:58479 one biological description; GO:58014 another description; GO:37402 description 3      ...      ...          NaN
1  gene 2                                     GO:58479 one biological description; GO:28390 one more      ...      ...          NaN
2  gene 3                                                        GO:58479 one biological description      ...      ...          NaN
3  gene 4                                                                     GO:37402 description 3      ...      ...          NaN

NB :
In your table2, you write (gen1) instead of (gene 1). So I’m not sure if it’s a typo or not but if it’s the case, remove the string replacement line.

Answered By: abokey

Example

we need miniaml and reproducible example code for answer. let’s make

data1 = [['id1', 1], ['id2', 2], ['id3', 3]]
data2 = [['go1', 'des1', 'id1/id2/id3'], ['go2', 'des2', 'id1']]
df1 = pd.DataFrame(data1, columns=['id', 'col1'])
df2 = pd.DataFrame(data2, columns=['go_id', 'des', 'id']) 

df1

    id  col1
0   id1 1
1   id2 2
2   id3 3

df2

    go_id   des     id
0   go1     des1    id1/id2/id3
1   go2     des2    id1

Code

1st step. split df2 and explode

df3 = df2.assign(id=df2['id'].str.split('/')).explode('id')

df3

    go_id   des     id
0   go1     des1    id1
0   go1     des1    id2
0   go1     des1    id3
1   go2     des2    id1

2nd step. join and merge

out = (df3['go_id'].str.cat(df3['des'], ' ').groupby(df3['id']).agg(';'.join)
       .reset_index(name='go').merge(df1, how='right'))

out

    id  go                  col1
0   id1 go1 des1;go2 des2   1
1   id2 go1 des1            2
2   id3 go1 des1            3
Answered By: Panda Kim
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.