Pandas: from rows to multiple columns, grouped by

Question

I’ve been stuck on something for a good while now. I’ve looked at existing threads first, but couldn’t find anything that mixed get_dummies() and aggregation.

I have a dataset that looks like this:

def getTags(serie):
    tokens = word_tokenize(serie)
    tags = pos_tag(tokens)
    tags = [t[1] for t in tags if t[0] not in punctuation]
    result = " | ".join(tags)
    return result

df = pd.DataFrame({"Text": sentences})
df["Tags"] = df["Text"].apply(lambda x: getTags(x))
df

Where the "Tags" serie can be either an array or a " | " separated string depending on whether the getTags() function returns tags or result.

I’m trying to get each POS tag within the "Tags" serie to have its own dedicated serie. For instance:

Col1 | Col2
blue | A B A C
red  | B A C C C C D

would become:

Col1 | A | B | C | D
blue | 2 | 1 | 1 | 0
red  | 1 | 1 | 4 | 1

Back to my dataframe, I’ve tried:

(
df["Tags"]
.str
.get_dummies()
)

Which correctly splits the tags, but creates duplicate columns (makes sense, I didn’t aggregate anything):

I then thought of transposing that, doing a groupby() and a sum() and transposing back again:

(
df["Tags"]
.str
.get_dummies()
.T
.reset_index()
.groupby("index", as_index=False)
.sum()
)

Unfortunately, that doesn’t seem to work my tags rows ("index" column) still show duplicate values:

Questions:

is this the right approach?
if yes, what am I doing wrong after transposing the dummies results?
if no, how would you do that?

Thanks!

Asked By: Louloumonkey

||

Source

Answer 1

import pandas as pd

df = pd.DataFrame({
'col1': ['blue', 'red'],
'col2': ['A|B|A|C', 'B|A|C|C|C|C|D'],
})

df['col2']=df['col2'].str.split('|')

df = ( df.explode(column='col2')
         .reset_index(drop=True)
       )

df['cumcount'] = df.groupby(['col1', 'col2']).cumcount()+1

df = ( df.drop_duplicates(subset=['col1', 'col2'], keep='last')
         .pivot(index='col1', columns='col2')
         .fillna(0)
         .droplevel(level=0, axis=1)
         .astype(int)
        )

df.columns.name=''

print(df)

      A  B  C  D
col1            
blue  2  1  1  0
red   1  1  4  1

Answered By: Laurent B.

Pandas: from rows to multiple columns, grouped by

Question:

Answers: