pandas groupby and join lists
Question:
I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
- The dtypes of the dataframe are object(s). convert_objects() doesn’t convert column_b to list automatically. How can I do this?
- what does the function in df.groupby(…).apply(lambda x: …) apply to ? what is the form of x ? list?
- the solution to my main problem?
Thanks in advance.
Answers:
object
dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects
tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a
. Read more about groupby.
This is doing a regular list sum
(concatenation) just like [1, 2, 3] + [2, 5]
with the result [1, 2, 3, 2, 5]
df.groupby('column_a').agg(sum)
This works because of operator overloading sum
concatenates the lists together. The index of the resulting df will be the values from column_a
:
Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == @clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It’s faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.
The approach proposed above using df.groupby('column_a').agg(sum)
definetly works. However, you have to make sure that your list only contains integers
, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))
The accepted answer suggests to use groupby.sum
, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain
or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain
:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())
).
Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})
I have a dataframe df, with two columns, I want to groupby one column and join the lists belongs to same group, example:
column_a, column_b
1, [1,2,3]
1, [2,5]
2, [5,6]
after the process:
column_a, column_b
1, [1,2,3,2,5]
2, [5,6]
I want to keep all the duplicates. I have the following questions:
- The dtypes of the dataframe are object(s). convert_objects() doesn’t convert column_b to list automatically. How can I do this?
- what does the function in df.groupby(…).apply(lambda x: …) apply to ? what is the form of x ? list?
- the solution to my main problem?
Thanks in advance.
object
dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects
tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a
. Read more about groupby.
This is doing a regular list sum
(concatenation) just like [1, 2, 3] + [2, 5]
with the result [1, 2, 3, 2, 5]
df.groupby('column_a').agg(sum)
This works because of operator overloading sum
concatenates the lists together. The index of the resulting df will be the values from column_a
:
Use numpy and simple "for" or "map":
import numpy as np
u_clm = np.unique(df.column_a.values)
all_lists = []
for clm in u_clm:
df_process = df.query('column_a == @clm')
list_ = np.concatenate(df.column_b.values)
all_lists.append((clm, list_.tolist()))
df_sum_lists = pd.DataFrame(all_lists)
It’s faster in 350 times than a simple "groupby-agg-sum" approach for huge datasets.
The approach proposed above using df.groupby('column_a').agg(sum)
definetly works. However, you have to make sure that your list only contains integers
, otherwise the output will not be the same.
If you want to convert all of the lists items into integers, you can use:
df['column_a'] = df['column_a'].apply(lambda x: list(map(int, x)))
The accepted answer suggests to use groupby.sum
, which is working fine with small number of lists, however using sum to concatenate lists is quadratic.
For a larger number of lists, a much faster option would be to use itertools.chain
or a list comprehension:
df = pd.DataFrame({'column_a': ['1', '1', '2'],
'column_b': [['1', '2', '3'], ['2', '5'], ['5', '6']]})
itertools.chain
:
from itertools import chain
out = (df.groupby('column_a', as_index=False)['column_b']
.agg(lambda x: list(chain.from_iterable(x)))
)
list comprehension:
out = (df.groupby('column_a', as_index=False, sort=False)['column_b']
.agg(lambda x: [e for l in x for e in l])
)
output:
column_a column_b
0 1 [1, 2, 3, 2, 5]
1 2 [5, 6]
Comparison of speed
Using n repeats of the example to show the impact of the number of lists to merge:
test_df = pd.concat([df]*n, ignore_index=True)
NB. also comparing the numpy approach (agg(lambda x: np.concatenate(x.to_numpy()).tolist())
).
Thanks, helped me
merge.fillna("", inplace = True) new_merge = merge.groupby(['id']).agg({ 'q1':lambda x: ','.join(x), 'q2':lambda x: ','.join(x),'q2_bookcode':lambda x: ','.join(x), 'q1_bookcode':lambda x: ','.join(x)})