Can pandas groupby aggregate into a list, rather than sum, mean, etc?

Question:

I’ve had success using the groupby function to sum or average a given variable by groups, but is there a way to aggregate into a list of values, rather than to get a single result? (And would this still be called aggregation?)

I am not entirely sure this is the approach I should be taking anyhow, so below is an example of the transformation I’d like to make, with toy data.

That is, if the data look something like this:

    A    B    C  
    1    10   22
    1    12   20
    1    11   8
    1    10   10
    2    11   13
    2    12   10 
    3    14   0

What I am trying to end up with is something like the following. I am not totally sure whether this can be done through groupby aggregating into lists, and am rather lost as to where to go from here.

Hypothetical output:

     A    B    C  New1  New2  New3  New4  New5  New6
    1    10   22  12    20    11    8     10    10
    2    11   13  12    10 
    3    14   0

Perhaps I should be pursuing pivots instead? The order by which the data are put into columns does not matter – all columns B through New6 in this example are equivalent. All suggestions/corrections are much appreciated.

Asked By: M.A.Kline

||

Answers:

my solution is a bit longer than you may expect, I’m sure it could be shortened, but:

g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"])))
k = g.reset_index()
k["i"] = k1.index
k["rn"] = k1.groupby("A")["i"].rank()
k.pivot_table(rows="A", cols="rn", values=0)

# output
# rn   1   2   3   4   5   6
# A                         
# 1   10  12  11  22  20   8
# 2   10  11  10  13 NaN NaN
# 3   14  10 NaN NaN NaN NaN

A bit of explanation. First line, g = df.groupby("A").apply(lambda x: pd.concat((x["B"], x["C"]))). This one group df by A and then put columns B and C into one column:

A   
1  0    10
   1    12
   2    11
   0    22
   1    20
   2     8
2  3    10
   4    11
   3    10
   4    13
3  5    14
   5    10

Then k = g.reset_index(), creating sequential index, result is:

    A  level_1   0
0   1        0  10
1   1        1  12
2   1        2  11
3   1        0  22
4   1        1  20
5   1        2   8
6   2        3  10
7   2        4  11
8   2        3  10
9   2        4  13
10  3        5  14
11  3        5  10

Now I want to move this index into column (I’d like to hear how I can make a sequential column without resetting index), k["i"] = k1.index:

    A  level_1   0   i
0   1        0  10   0
1   1        1  12   1
2   1        2  11   2
3   1        0  22   3
4   1        1  20   4
5   1        2   8   5
6   2        3  10   6
7   2        4  11   7
8   2        3  10   8
9   2        4  13   9
10  3        5  14  10
11  3        5  10  11

Now, k["rn"] = k1.groupby("A")["i"].rank() will add row_number inside each A (like row_number() over(partition by A order by i) in SQL:

    A  level_1   0   i  rn
0   1        0  10   0   1
1   1        1  12   1   2
2   1        2  11   2   3
3   1        0  22   3   4
4   1        1  20   4   5
5   1        2   8   5   6
6   2        3  10   6   1
7   2        4  11   7   2
8   2        3  10   8   3
9   2        4  13   9   4
10  3        5  14  10   1
11  3        5  10  11   2

And finally, just pivoting with k.pivot_table(rows="A", cols="rn", values=0):

rn   1   2   3   4   5   6
A                         
1   10  12  11  22  20   8
2   10  11  10  13 NaN NaN
3   14  10 NaN NaN NaN NaN
Answered By: Roman Pekar

I have been struggling with the exact same issues, and the answer is that yes you can use grouby to obtain lists. I am not 100% sure I am doing this in the most pythonic way, but here for what its worth is my attempt to get at your question. You can create lists of the data contained in the bygroups like this:

import pandas as pd
import numpy as np
from itertools import chain

Data = {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20,     8, 10, 13, 10, 0]}
DF = pd.DataFrame(Data)
DFGrouped = DF.groupby('A')

OutputLists = []

for group in DFGrouped:
    AList = list(group[1].A)
    BList = list(group[1].B)
    CList = list(group[1].C)
    print list(group[1].A)
    print list(group[1].B)
    print list(group[1].C)
    ZIP =  zip(BList, CList)
    print ZIP
    OutputLists.append(list(chain(*ZIP)))

OutputLists

This outputs your data in a list of lists, in the way that I think you want it. You can then make it a data frame. The above print statements are for illustrative purposes only clearly. The most efficient (in terms of code) way to do this using my method is as follows:

import pandas as pd
import numpy as np
from itertools import chain

Data = {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20, 8, 10, 13, 10, 0]}
DF = pd.DataFrame(Data)
DFGrouped = DF.groupby('A')
OutputLists = []
for group in DFGrouped:
    ZIPPED = zip(group[1].B, group[1].C)
    OutputLists.append(list(chain(*ZIPPED)))
OutputLists

The key to getting lists out of grouped data as far as I can tell is to recognise that the data themselves are stored in group[1] for each group in your grouped data.

hope this helps!

Answered By: Woody Pride

I am answering the question as stated in its title and first sentence: the following aggregates values to lists:

df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())

Below this is demonstrated in a simple example:

import pandas as pd

df = pd.DataFrame( {'A' : [1, 1, 1, 1, 2, 2, 3], 'B' : [10, 12, 11, 10, 11, 12, 14], 'C' : [22, 20,     8, 10, 13, 10, 0]})
print df

df2 = df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())
print df2
# Old version:
# df2=df.groupby(['A']).apply(lambda tdf: pd.Series(  dict([[vv,tdf[vv].unique().tolist()] for vv in tdf if vv not in ['A']])  )) 

The output is as follows:

In [3]: run tmp
   A   B   C
0  1  10  22
1  1  12  20
2  1  11   8
3  1  10  10
4  2  11  13
5  2  12  10
6  3  14   0

[7 rows x 3 columns]
              B                C
A                               
1  [10, 12, 11]  [22, 20, 8, 10]
2      [11, 12]         [13, 10]
3          [14]              [0]

[3 rows x 2 columns]
Answered By: CPBL

I used the following

grouped = df.groupby('A')

df = grouped.aggregate(lambda x: tuple(x))

df['grouped'] = df['B'] + df['C']
Answered By: user2623954

Here is a one liner

# if list of unique items is desired, use set
df.groupby('A',as_index=False)['B'].aggregate(lambda x: set(x))

# if duplicate items are okay, use list
df.groupby('A',as_index=False)['B'].aggregate(lambda x: list(x))
Answered By: muon

Similar solution, but fairly transparent (I think). you can get full list or unique lists.

df = pd.DataFrame({'A':[1,1,2,2,2,3,3,3,4,5], 
                   'B':[6,7, 8,8,9, 9,9,10,11,12], 
                   'C':['foo']*10})

df
Out[24]: 
   A   B    C
0  1   6  foo
1  1   7  foo
2  2   8  foo
3  2   8  foo
4  2   9  foo
5  3   9  foo
6  3   9  foo
7  3  10  foo
8  4  11  foo
9  5  12  foo

list_agg = df.groupby(by='A').agg({'B':lambda x: list(x), 
                                   'C':lambda x: tuple(x)})

list_agg
Out[26]: 
                 C           B
A                             
1       (foo, foo)      [6, 7]
2  (foo, foo, foo)   [8, 8, 9]
3  (foo, foo, foo)  [9, 9, 10]
4           (foo,)        [11]
5           (foo,)        [12]

unique_list_agg = df.groupby(by='A').agg({'B':lambda x: list(pd.unique(x)), 
                                          'C':lambda x: tuple(pd.unique(x))})

unique_list_agg
Out[28]: 
        C        B
A                 
1  (foo,)   [6, 7]
2  (foo,)   [8, 9]
3  (foo,)  [9, 10]
4  (foo,)     [11]
5  (foo,)     [12]
Answered By: Data-phile
df2 = df.groupby('A').aggregate(lambda tdf: tdf.unique().tolist())

This seems to work perfect, but the resultant dataframe has two layers of columns and df.columns shows only one column in the dataframe.
To correct this, use:

df2_copy=df2.copy()
df2_copy = df2_copy.reset_index(col_level=0)

You can view the column levels using:
df2_copy.columns=df2_copy.columns.get_level_values(0)

the df2_copy() should solve this.

Answered By: SummmerFort
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.