Pandas groupby two columns and expand the third

Question:

I have a Pandas dataframe with the following structure:

A       B       C
a       b       1
a       b       2
a       b       3
c       d       7
c       d       8
c       d       5
c       d       6
c       d       3
e       b       4
e       b       3
e       b       2
e       b       1

And I will like to transform it into this:

A       B       C1      C2      C3      C4      C5
a       b       1       2       3       NAN     NAN
c       d       7       8       5       6       3
e       b       4       3       2       1       NAN

In other words, something like groupby A and B and expand C into different columns.

Knowing that the length of each group is different.

C is already ordered

Shorter groups can have NAN or NULL values (empty), it does not matter.

Asked By: mirix

||

Answers:

Use GroupBy.cumcount and pandas.Series.add with 1, to start naming the new columns from 1 onwards, then pass this to DataFrame.pivot, and add DataFrame.add_prefix to rename the columns (C1, C2, C3, etc…). Finally use DataFrame.rename_axis to remove the indexes original name (‘g’) and transform the MultiIndex into columns by using DataFrame.reset_indexcolumns A,B:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = df.pivot(['A','B'], 'g', 'C').add_prefix('C').rename_axis(columns=None).reset_index()
print (df)
   A  B   C1   C2   C3   C4   C5
0  a  b  1.0  2.0  3.0  NaN  NaN
1  c  d  7.0  8.0  5.0  6.0  3.0
2  e  b  4.0  3.0  2.0  1.0  NaN

Because NaN is by default of type float, if you need the columns dtype to be integers add DataFrame.astype with Int64:

df['g'] = df.groupby(['A','B']).cumcount().add(1)

df = (df.pivot(['A','B'], 'g', 'C')
        .add_prefix('C')
        .astype('Int64')
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B  C1  C2  C3    C4    C5
0  a  b   1   2   3  <NA>  <NA>
1  c  d   7   8   5     6     3
2  e  b   4   3   2     1  <NA>

EDIT: If there’s a maximum N new columns to be added, it means that A,B are duplicated. Therefore, it will beneeded to add helper groups g1, g2 with integer and modulo division, adding a new level in index:

N = 4
g  = df.groupby(['A','B']).cumcount()
df['g1'], df['g2'] = g // N, (g % N) + 1
df = (df.pivot(['A','B','g1'], 'g2', 'C')
        .add_prefix('C')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())
print (df)
   A  B   C1   C2   C3   C4
0  a  b  1.0  2.0  3.0  NaN
1  c  d  7.0  8.0  5.0  6.0
2  c  d  3.0  NaN  NaN  NaN
3  e  b  4.0  3.0  2.0  1.0 
Answered By: jezrael
df1.astype({'C':str}).groupby([*'AB'])
    .agg(','.join).C.str.split(',',expand=True)
    .add_prefix('C').reset_index()


 A  B C0 C1 C2    C3    C4
0  a  b  1  2  3  None  None
1  c  d  7  8  5     6     3
2  e  b  4  3  2     1  None
Answered By: G.G

The accepted solution but avoiding the deprecation warning:

N = 3
g  = df_grouped.groupby(['A','B']).cumcount()
df_grouped['g1'], df_grouped['g2'] = g // N, (g % N) + 1
df_grouped = (df_grouped.pivot(index=['A','B','g1'], columns='g2', values='C')
        .add_prefix('C_')
        .astype('Int64')
        .droplevel(-1)
        .rename_axis(columns=None)
        .reset_index())
Answered By: mirix
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.