Agregate two column values from a dataframe together in one column of another dataframe in Python

Question

I have a dataframe with unique gene IDs and multiple columns like this

Table 1

geneID	column 1	column 2
gene1	…	…
gene2	…	…
gene3	…	…
gene4	…	…
…	…	…

And another dataframe with unique GO IDs with more than one gene IDs

Table 2

GO_ID	Description	geneID
GO:58479	one biological description	gene1/gene2/gene3
GO:58014	another description	gene1
GO:37402	description 3	gene1/gene4/gene6
GO:28390	one more	gene2
…	…	…

I don’t know if it’s possible but I would like to add a new column to the table 1 with all the GO IDs and descriptions for each gene. I want just a space bar separation between ID and description but another separation like "/" or ";" between different GO IDs like this:

Table 3

geneID	GO	column 1
gene1	GO:58479 one biological description; GO:58014 another description; GO:37402 description 3;…	…
gene2	GO:58479 one biological description; GO:28390 one more; …	…
gene3	…	…
gene4	…	…
…	…	…

Any idea how to solve it?

Asked By: Antuan Rage

||

Source

Answer 1

Here is a proposition with pandas.DataFrame.merge and pandas.DataFrame.explode :

out = (
        df1
          .merge(df2
                  .assign(geneID= df2["geneID"]
                                    .replace("gen", "gene ", regex=True) #read NB below
                                    .str.split("/"),
                          GO= df2[["GO_ID", "Description"]].agg(" ".join, axis=1))
                  .explode("geneID")
                  .groupby("geneID", as_index=False, sort=False)["GO"].agg("; ".join),
                 on="geneID", how="left")
          [["geneID", "GO"] + df1.columns[1:].tolist()]
      )

# Output :

print(out)

   geneID                                                                                         GO column 1 column 2  column n...
0  gene 1  GO:58479 one biological description; GO:58014 another description; GO:37402 description 3      ...      ...          NaN
1  gene 2                                     GO:58479 one biological description; GO:28390 one more      ...      ...          NaN
2  gene 3                                                        GO:58479 one biological description      ...      ...          NaN
3  gene 4                                                                     GO:37402 description 3      ...      ...          NaN

NB :
In your table2, you write (gen1) instead of (gene 1). So I’m not sure if it’s a typo or not but if it’s the case, remove the string replacement line.

Answered By: abokey

Answer 2

Example

we need miniaml and reproducible example code for answer. let’s make

data1 = [['id1', 1], ['id2', 2], ['id3', 3]]
data2 = [['go1', 'des1', 'id1/id2/id3'], ['go2', 'des2', 'id1']]
df1 = pd.DataFrame(data1, columns=['id', 'col1'])
df2 = pd.DataFrame(data2, columns=['go_id', 'des', 'id'])

df1

    id  col1
0   id1 1
1   id2 2
2   id3 3

df2

    go_id   des     id
0   go1     des1    id1/id2/id3
1   go2     des2    id1

Code

1st step. split df2 and explode

df3 = df2.assign(id=df2['id'].str.split('/')).explode('id')

df3

    go_id   des     id
0   go1     des1    id1
0   go1     des1    id2
0   go1     des1    id3
1   go2     des2    id1

2nd step. join and merge

out = (df3['go_id'].str.cat(df3['des'], ' ').groupby(df3['id']).agg(';'.join)
       .reset_index(name='go').merge(df1, how='right'))

out

    id  go                  col1
0   id1 go1 des1;go2 des2   1
1   id2 go1 des1            2
2   id3 go1 des1            3

Answered By: Panda Kim

Agregate two column values from a dataframe together in one column of another dataframe in Python

Question:

Answers:

# Output :