reshape pandas data frame: duplicated rows to columns, with textual data
Question:
I have a dataframe like this:
INDEX_COL col1
A Random Text
B Some more random text
C more stuff
A Blah
B Blah, Blah
C Yet more stuff
A erm
B yup
C whatever
What I need is it reformed into new columns and stacked/grouped by values in col_1. So something like this:
A B C
Random Text Some more random text more stuff
Blah Blah, Blah Yet more stuff
erm yup whatever
I’ve reviewed How can I pivot a dataframe? but all of the examples work with numerical data and this is a use case that involves textual data, so aggregation appears to be not an option (but it was – see accepted answer below)
I’ve tried the following:
Pivot – but all the examples I’ve seen involve numerical values with aggregate functions. This is reshaping non-numerical data
I get that index=INDEX COL, and columns= ‘col1’, but values? add a numerical column, pivot and then drop the numberical columns created? Feels like trying for forced pivot to do something it was never meant to do.
Unstack – but this seems to convert the df into a new df with a single value index of ‘b’
unstack(level=0)
I’ve even considered slicing the data frame by index into separate dataframes and the concatinating them, but the mismatched indexes result in NaN appearing like a checkerboard. Also this feels like an fugly solution.
I’ve tried dropping the index_col, with Col1=[‘A,B,C’] and col2= the random text, but the new integer index comes along and spoils the fun.
Any suggestions or thoughts in which direction I should go with this?
Answers:
You can use agg(list)
and then explode
the whole dataframe:
output = df.groupby('INDEX_COL').agg(list).T.apply(pd.Series.explode)
output:
INDEX_COL A B C
col1 Random Text Some more random text more stuff
col1 Blah Blah, Blah Yet more stuff
col1 erm yup whatever
Try this if ‘INDEX_COL’ is in the dataframe index:
df.set_index(df.groupby(level=0).cumcount(), append=True)['col1'].unstack(0)
Output:
INDEX_COL A B C
0 Random Text Some more random text more stuff
1 Blah Blah, Blah Yet more stuff
2 erm yup whatever
Otherwise, df = df.set_index('INDEX_COL')
first.
Another possible solution, using pandas.pivot_table
:
(df.pivot_table(columns='INDEX_COL', values='col1', aggfunc=list)
.pipe(lambda d: d.explode(d.columns.tolist()))
.reset_index(drop=True))
Output:
INDEX_COL A B C
0 Random Text Some more random text more stuff
1 Blah Blah, Blah Yet more stuff
2 erm yup whatever
I have a dataframe like this:
INDEX_COL col1
A Random Text
B Some more random text
C more stuff
A Blah
B Blah, Blah
C Yet more stuff
A erm
B yup
C whatever
What I need is it reformed into new columns and stacked/grouped by values in col_1. So something like this:
A B C
Random Text Some more random text more stuff
Blah Blah, Blah Yet more stuff
erm yup whatever
I’ve reviewed How can I pivot a dataframe? but all of the examples work with numerical data and this is a use case that involves textual data, so aggregation appears to be not an option (but it was – see accepted answer below)
I’ve tried the following:
Pivot – but all the examples I’ve seen involve numerical values with aggregate functions. This is reshaping non-numerical data
I get that index=INDEX COL, and columns= ‘col1’, but values? add a numerical column, pivot and then drop the numberical columns created? Feels like trying for forced pivot to do something it was never meant to do.
Unstack – but this seems to convert the df into a new df with a single value index of ‘b’
unstack(level=0)
I’ve even considered slicing the data frame by index into separate dataframes and the concatinating them, but the mismatched indexes result in NaN appearing like a checkerboard. Also this feels like an fugly solution.
I’ve tried dropping the index_col, with Col1=[‘A,B,C’] and col2= the random text, but the new integer index comes along and spoils the fun.
Any suggestions or thoughts in which direction I should go with this?
You can use agg(list)
and then explode
the whole dataframe:
output = df.groupby('INDEX_COL').agg(list).T.apply(pd.Series.explode)
output:
INDEX_COL A B C
col1 Random Text Some more random text more stuff
col1 Blah Blah, Blah Yet more stuff
col1 erm yup whatever
Try this if ‘INDEX_COL’ is in the dataframe index:
df.set_index(df.groupby(level=0).cumcount(), append=True)['col1'].unstack(0)
Output:
INDEX_COL A B C
0 Random Text Some more random text more stuff
1 Blah Blah, Blah Yet more stuff
2 erm yup whatever
Otherwise, df = df.set_index('INDEX_COL')
first.
Another possible solution, using pandas.pivot_table
:
(df.pivot_table(columns='INDEX_COL', values='col1', aggfunc=list)
.pipe(lambda d: d.explode(d.columns.tolist()))
.reset_index(drop=True))
Output:
INDEX_COL A B C
0 Random Text Some more random text more stuff
1 Blah Blah, Blah Yet more stuff
2 erm yup whatever