aggregate pivot in pandas with multiple repeated fields
Question:
I have a dataframe that looks like this:
id Field_name Field_value
1 consent yes
1 _REACTION TIME_ 5547
1 age 24
1 gender X
1 _REACTION TIME_ 45396
1 education uni
1 language EN
1 _REACTION TIME_ 105187
2 consent yes
2 _REACTION TIME_ 3547
2 age 25
2 gender F
2 _REACTION TIME_ 42396
2 education uni
2 language EU
2 _REACTION TIME_ 115427
and I would like to have it as a row per id, with every _REACTION TIME_
row being a different column, such as:
id consent _REACTION TIME_1 age gender _REACTION TIME_2 education language _REACTION TIME_3
1 yes 5547 24 X 45396 uni EN 105187
2 yes 3547 25 F 42396 uni EU 115427
I have been looking for an answer to this all over SO but I can’t find it for this particular issue when only some of the entries are repeated, but they are repeated multiple times.
Thanks in advance!
Answers:
Use GroupBy.cumcount
only for rows with duplicates by DataFrame.duplicated
, so possible pivoting by DataFrame.pivot
, last for original order of columns add DataFrame.reindex
:
m = df.duplicated(['id','Field_name'], keep=False)
df.loc[m, 'Field_name'] += df[m].groupby(['id','Field_name']).cumcount().add(1).astype(str)
cols = df['Field_name'].unique()
df = df.pivot(index='id', columns='Field_name', values='Field_value').reindex(cols, axis=1)
print (df)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education
id
1 yes 5547 24 X 45396 uni
2 yes 3547 25 F 42396 uni
Field_name language _REACTION TIME_3
id
1 EN 105187
2 EU 115427
Solution avoiding overwrite original DataFrame is similar:
m = df.duplicated(['id','Field_name'], keep=False)
s = df['Field_name'].add(df.groupby(['id','Field_name']).cumcount().add(1)
.astype(str)).where(m, df['Field_name'])
df1 = (df.assign(Field_name=s)
.pivot(index='id', columns='Field_name', values='Field_value')
.reindex(s.unique(), axis=1))
print (df1)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education
id
1 yes 5547 24 X 45396 uni
2 yes 3547 25 F 42396 uni
Field_name language _REACTION TIME_3
id
1 EN 105187
2 EU 115427
If you want to remain _REACTION TIME_
instead of renaming it as _REACTION TIME_1
in column header, you can do groupby.apply
out = (df.groupby('id').apply(lambda g: g.drop('id', axis=1).set_index('Field_name').T)
.reset_index(level=0).reset_index(drop=True)
.rename_axis('', axis=1))
print(out)
id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
0 1 yes 5547 24 X 45396 uni EN 105187
1 2 yes 3547 25 F 42396 uni EU 115427
I have a dataframe that looks like this:
id Field_name Field_value
1 consent yes
1 _REACTION TIME_ 5547
1 age 24
1 gender X
1 _REACTION TIME_ 45396
1 education uni
1 language EN
1 _REACTION TIME_ 105187
2 consent yes
2 _REACTION TIME_ 3547
2 age 25
2 gender F
2 _REACTION TIME_ 42396
2 education uni
2 language EU
2 _REACTION TIME_ 115427
and I would like to have it as a row per id, with every _REACTION TIME_
row being a different column, such as:
id consent _REACTION TIME_1 age gender _REACTION TIME_2 education language _REACTION TIME_3
1 yes 5547 24 X 45396 uni EN 105187
2 yes 3547 25 F 42396 uni EU 115427
I have been looking for an answer to this all over SO but I can’t find it for this particular issue when only some of the entries are repeated, but they are repeated multiple times.
Thanks in advance!
Use GroupBy.cumcount
only for rows with duplicates by DataFrame.duplicated
, so possible pivoting by DataFrame.pivot
, last for original order of columns add DataFrame.reindex
:
m = df.duplicated(['id','Field_name'], keep=False)
df.loc[m, 'Field_name'] += df[m].groupby(['id','Field_name']).cumcount().add(1).astype(str)
cols = df['Field_name'].unique()
df = df.pivot(index='id', columns='Field_name', values='Field_value').reindex(cols, axis=1)
print (df)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education
id
1 yes 5547 24 X 45396 uni
2 yes 3547 25 F 42396 uni
Field_name language _REACTION TIME_3
id
1 EN 105187
2 EU 115427
Solution avoiding overwrite original DataFrame is similar:
m = df.duplicated(['id','Field_name'], keep=False)
s = df['Field_name'].add(df.groupby(['id','Field_name']).cumcount().add(1)
.astype(str)).where(m, df['Field_name'])
df1 = (df.assign(Field_name=s)
.pivot(index='id', columns='Field_name', values='Field_value')
.reindex(s.unique(), axis=1))
print (df1)
Field_name consent _REACTION TIME_1 age gender _REACTION TIME_2 education
id
1 yes 5547 24 X 45396 uni
2 yes 3547 25 F 42396 uni
Field_name language _REACTION TIME_3
id
1 EN 105187
2 EU 115427
If you want to remain _REACTION TIME_
instead of renaming it as _REACTION TIME_1
in column header, you can do groupby.apply
out = (df.groupby('id').apply(lambda g: g.drop('id', axis=1).set_index('Field_name').T)
.reset_index(level=0).reset_index(drop=True)
.rename_axis('', axis=1))
print(out)
id consent _REACTION_TIME_ age gender _REACTION_TIME_ education language _REACTION_TIME_
0 1 yes 5547 24 X 45396 uni EN 105187
1 2 yes 3547 25 F 42396 uni EU 115427