reshape pandas data frame: duplicated rows to columns, with textual data

Question:

I have a dataframe like this:

INDEX_COL                col1
A                        Random Text 
B                        Some more random text
C                        more stuff
A                        Blah
B                        Blah, Blah
C                        Yet more stuff
A                        erm
B                        yup
C                        whatever

What I need is it reformed into new columns and stacked/grouped by values in col_1. So something like this:

A                               B                              C
Random Text                     Some more random text          more stuff
Blah                            Blah, Blah                     Yet more stuff
erm                             yup                            whatever

I’ve reviewed How can I pivot a dataframe? but all of the examples work with numerical data and this is a use case that involves textual data, so aggregation appears to be not an option (but it was – see accepted answer below)

I’ve tried the following:

Pivot – but all the examples I’ve seen involve numerical values with aggregate functions. This is reshaping non-numerical data

I get that index=INDEX COL, and columns= ‘col1’, but values? add a numerical column, pivot and then drop the numberical columns created? Feels like trying for forced pivot to do something it was never meant to do.

Unstack – but this seems to convert the df into a new df with a single value index of ‘b’

unstack(level=0)

I’ve even considered slicing the data frame by index into separate dataframes and the concatinating them, but the mismatched indexes result in NaN appearing like a checkerboard. Also this feels like an fugly solution.

I’ve tried dropping the index_col, with Col1=[‘A,B,C’] and col2= the random text, but the new integer index comes along and spoils the fun.

Any suggestions or thoughts in which direction I should go with this?

Asked By: Greg Williams

||

Answers:

You can use agg(list) and then explode the whole dataframe:

output =  df.groupby('INDEX_COL').agg(list).T.apply(pd.Series.explode)

output:

INDEX_COL   A          B                        C
col1    Random Text   Some more random text   more stuff
col1    Blah          Blah, Blah         Yet more stuff
col1    erm              yup               whatever
Answered By: Nuri Taş

Try this if ‘INDEX_COL’ is in the dataframe index:

df.set_index(df.groupby(level=0).cumcount(), append=True)['col1'].unstack(0)

Output:

INDEX_COL            A                      B               C
0          Random Text  Some more random text      more stuff
1                 Blah             Blah, Blah  Yet more stuff
2                  erm                    yup        whatever

Otherwise, df = df.set_index('INDEX_COL') first.

Answered By: Scott Boston

Another possible solution, using pandas.pivot_table:

(df.pivot_table(columns='INDEX_COL', values='col1', aggfunc=list)
 .pipe(lambda d: d.explode(d.columns.tolist()))
 .reset_index(drop=True))

Output:

INDEX_COL            A                      B               C
0          Random Text  Some more random text      more stuff
1                 Blah             Blah, Blah  Yet more stuff
2                  erm                    yup        whatever
Answered By: PaulS