Pandas: from rows to multiple columns, grouped by

Question:

I’ve been stuck on something for a good while now. I’ve looked at existing threads first, but couldn’t find anything that mixed get_dummies() and aggregation.

I have a dataset that looks like this:

def getTags(serie):
    tokens = word_tokenize(serie)
    tags = pos_tag(tokens)
    tags = [t[1] for t in tags if t[0] not in punctuation]
    result = " | ".join(tags)
    return result

df = pd.DataFrame({"Text": sentences})
df["Tags"] = df["Text"].apply(lambda x: getTags(x))
df

enter image description here

Where the "Tags" serie can be either an array or a " | " separated string depending on whether the getTags() function returns tags or result.

I’m trying to get each POS tag within the "Tags" serie to have its own dedicated serie. For instance:

Col1 | Col2
blue | A B A C
red  | B A C C C C D

would become:

Col1 | A | B | C | D
blue | 2 | 1 | 1 | 0
red  | 1 | 1 | 4 | 1 

Back to my dataframe, I’ve tried:

(
df["Tags"]
.str
.get_dummies()
)

Which correctly splits the tags, but creates duplicate columns (makes sense, I didn’t aggregate anything):

enter image description here

I then thought of transposing that, doing a groupby() and a sum() and transposing back again:

(
df["Tags"]
.str
.get_dummies()
.T
.reset_index()
.groupby("index", as_index=False)
.sum()
)

Unfortunately, that doesn’t seem to work my tags rows ("index" column) still show duplicate values:

enter image description here

Questions:

  • is this the right approach?
  • if yes, what am I doing wrong after transposing the dummies results?
  • if no, how would you do that?

Thanks!

Asked By: Louloumonkey

||

Answers:

import pandas as pd

df = pd.DataFrame({
'col1': ['blue', 'red'],
'col2': ['A|B|A|C', 'B|A|C|C|C|C|D'],
})

df['col2']=df['col2'].str.split('|')

df = ( df.explode(column='col2')
         .reset_index(drop=True)
       )

df['cumcount'] = df.groupby(['col1', 'col2']).cumcount()+1

df = ( df.drop_duplicates(subset=['col1', 'col2'], keep='last')
         .pivot(index='col1', columns='col2')
         .fillna(0)
         .droplevel(level=0, axis=1)
         .astype(int)
        )

df.columns.name=''

print(df)
      A  B  C  D
col1            
blue  2  1  1  0
red   1  1  4  1
Answered By: Laurent B.
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.