Pandas: from rows to multiple columns, grouped by
Question:
I’ve been stuck on something for a good while now. I’ve looked at existing threads first, but couldn’t find anything that mixed get_dummies()
and aggregation.
I have a dataset that looks like this:
def getTags(serie):
tokens = word_tokenize(serie)
tags = pos_tag(tokens)
tags = [t[1] for t in tags if t[0] not in punctuation]
result = " | ".join(tags)
return result
df = pd.DataFrame({"Text": sentences})
df["Tags"] = df["Text"].apply(lambda x: getTags(x))
df
Where the "Tags" serie can be either an array or a " | " separated string depending on whether the getTags()
function returns tags
or result
.
I’m trying to get each POS tag within the "Tags" serie to have its own dedicated serie. For instance:
Col1 | Col2
blue | A B A C
red | B A C C C C D
would become:
Col1 | A | B | C | D
blue | 2 | 1 | 1 | 0
red | 1 | 1 | 4 | 1
Back to my dataframe, I’ve tried:
(
df["Tags"]
.str
.get_dummies()
)
Which correctly splits the tags, but creates duplicate columns (makes sense, I didn’t aggregate anything):
I then thought of transposing that, doing a groupby()
and a sum()
and transposing back again:
(
df["Tags"]
.str
.get_dummies()
.T
.reset_index()
.groupby("index", as_index=False)
.sum()
)
Unfortunately, that doesn’t seem to work my tags rows ("index" column) still show duplicate values:
Questions:
- is this the right approach?
- if yes, what am I doing wrong after transposing the dummies results?
- if no, how would you do that?
Thanks!
Answers:
import pandas as pd
df = pd.DataFrame({
'col1': ['blue', 'red'],
'col2': ['A|B|A|C', 'B|A|C|C|C|C|D'],
})
df['col2']=df['col2'].str.split('|')
df = ( df.explode(column='col2')
.reset_index(drop=True)
)
df['cumcount'] = df.groupby(['col1', 'col2']).cumcount()+1
df = ( df.drop_duplicates(subset=['col1', 'col2'], keep='last')
.pivot(index='col1', columns='col2')
.fillna(0)
.droplevel(level=0, axis=1)
.astype(int)
)
df.columns.name=''
print(df)
A B C D
col1
blue 2 1 1 0
red 1 1 4 1
I’ve been stuck on something for a good while now. I’ve looked at existing threads first, but couldn’t find anything that mixed get_dummies()
and aggregation.
I have a dataset that looks like this:
def getTags(serie):
tokens = word_tokenize(serie)
tags = pos_tag(tokens)
tags = [t[1] for t in tags if t[0] not in punctuation]
result = " | ".join(tags)
return result
df = pd.DataFrame({"Text": sentences})
df["Tags"] = df["Text"].apply(lambda x: getTags(x))
df
Where the "Tags" serie can be either an array or a " | " separated string depending on whether the getTags()
function returns tags
or result
.
I’m trying to get each POS tag within the "Tags" serie to have its own dedicated serie. For instance:
Col1 | Col2
blue | A B A C
red | B A C C C C D
would become:
Col1 | A | B | C | D
blue | 2 | 1 | 1 | 0
red | 1 | 1 | 4 | 1
Back to my dataframe, I’ve tried:
(
df["Tags"]
.str
.get_dummies()
)
Which correctly splits the tags, but creates duplicate columns (makes sense, I didn’t aggregate anything):
I then thought of transposing that, doing a groupby()
and a sum()
and transposing back again:
(
df["Tags"]
.str
.get_dummies()
.T
.reset_index()
.groupby("index", as_index=False)
.sum()
)
Unfortunately, that doesn’t seem to work my tags rows ("index" column) still show duplicate values:
Questions:
- is this the right approach?
- if yes, what am I doing wrong after transposing the dummies results?
- if no, how would you do that?
Thanks!
import pandas as pd
df = pd.DataFrame({
'col1': ['blue', 'red'],
'col2': ['A|B|A|C', 'B|A|C|C|C|C|D'],
})
df['col2']=df['col2'].str.split('|')
df = ( df.explode(column='col2')
.reset_index(drop=True)
)
df['cumcount'] = df.groupby(['col1', 'col2']).cumcount()+1
df = ( df.drop_duplicates(subset=['col1', 'col2'], keep='last')
.pivot(index='col1', columns='col2')
.fillna(0)
.droplevel(level=0, axis=1)
.astype(int)
)
df.columns.name=''
print(df)
A B C D
col1
blue 2 1 1 0
red 1 1 4 1