Pandas groupby: How to get a union of strings

Question:

I have a dataframe like this:

   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

Calling

In [10]: print df.groupby("A")["B"].sum()

will return

A
1    1.615586
2    0.421821
3    0.463468
4    0.643961

Now I would like to do “the same” for column “C”. Because that column contains strings, sum() doesn’t work (although you might think that it would concatenate the strings). What I would really like to see is a list or set of the strings for each group, i.e.

A
1    {This, string}
2    {is, !}
3    {a}
4    {random}

I have been trying to find ways to do this.

Series.unique() (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) doesn’t work, although

df.groupby("A")["B"]

is a

pandas.core.groupby.SeriesGroupBy object

so I was hoping any Series method would work. Any ideas?

Asked By: Anne

||

Answers:

In [4]: df = read_csv(StringIO(data),sep='s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}
Answered By: Jeff

You can use the apply method to apply an arbitrary function to the grouped data. So if you want a set, apply set. If you want a list, apply list.

>>> d
   A       B
0  1    This
1  2      is
2  3       a
3  4  random
4  1  string
5  2       !
>>> d.groupby('A')['B'].apply(list)
A
1    [This, string]
2           [is, !]
3               [a]
4          [random]
dtype: object

If you want something else, just write a function that does what you want and then apply that.

Answered By: BrenBarn

You may be able to use the aggregate (or agg) function to concatenate the values. (Untested code)

df.groupby('A')['B'].agg(lambda col: ''.join(col))
Answered By: voithos

a simple solution would be :

>>> df.groupby(['A','B']).c.unique().reset_index()
Answered By: UserYmY

You could try this:

df.groupby('A').agg({'B':'sum','C':'-'.join})
Answered By: user3241146

If you’d like to overwrite column B in the dataframe, this should work:

    df = df.groupby('A',as_index=False).agg(lambda x:'n'.join(x))
Answered By: Amit

Named aggregations with pandas >= 0.25.0

Since pandas version 0.25.0 we have named aggregations where we can groupby, aggregate and at the same time assign new names to our columns. This way we won’t get the MultiIndex columns, and the column names make more sense given the data they contain:


aggregate and get a list of strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', list)).reset_index()

print(grp)
   A     B_sum               C
0  1  1.615586  [This, string]
1  2  0.421821         [is, !]
2  3  0.463468             [a]
3  4  0.643961        [random]

aggregate and join the strings

grp = df.groupby('A').agg(B_sum=('B','sum'),
                          C=('C', ', '.join)).reset_index()

print(grp)
   A     B_sum             C
0  1  1.615586  This, string
1  2  0.421821         is, !
2  3  0.463468             a
3  4  0.643961        random
Answered By: Erfan

Following @Erfan’s good answer, most of the times in an analysis of aggregate values you want the unique possible combinations of these existing character values:

unique_chars = lambda x: ', '.join(x.unique())
(df
 .groupby(['A'])
 .agg({'C': unique_chars}))
Answered By: Paul Rougieux
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.