Groupby specific to list of concatenated items

Question:

Let it be the following Python Panda DataFrame:

| ID   | time    | vector       | other |
| ---- | ------- | ------------ | ----- |
| 0    | 81219   | [N1, N2, N3] | blue  |
| 0    | 32323   | [N1, N4]     | green |
| 1    | 423     | [N2, N3]     | red   | 
| 1    | 4232    | [N1]         | blue  |
| 1    | 42424   | [N2, N2]     | red   |
| 2    | 42422   | [N3, N5]     | blue  |

I want to create a grouped DataFrame containing for each ID a column vector with the concatenation of the vectors for each element.

Example of the result:

| ID   | vector               |
| ---- | -------------------- |
| 0    | [N1, N2, N3, N1, N4] |
| 1    | [N2, N3, N1, N2, N2] | 
| 2    | [N3, N5]             |
Asked By: Carola

||

Answers:

One option could be to use groupby.sum:

out = df.groupby('ID', as_index=False)['vector'].sum()

However, if you have many lists, a repeated sum will be quadratic.

In this case, prefer to use itertools.chain:

from itertools import chain
out = (df.groupby('ID', as_index=False)['vector']
         .agg(lambda x: list(chain.from_iterable(x)))
       )

output:

   ID                vector
0   0  [N1, N2, N3, N1, N4]
1   1  [N2, N3, N1, N2, N2]
2   2              [N3, N5]

timing

Let’s use 10k repeats of the example and compare sum and itertools.chain:

df = pd.DataFrame({'ID': [0, 0, 1, 1, 1, 2],
                   'vector': [['N1', 'N2', 'N3'], ['N1', 'N4'], ['N2', 'N3'],
                              ['N1'], ['N2', 'N2'], ['N3', 'N5']]})
df = pd.concat([df]*10000, ignore_index=True)

%%timeit
df.groupby('ID', as_index=False)['vector'].sum()
# this is very slow!
# 1.86 s ± 38.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
from itertools import chain
df.groupby('ID', as_index=False)['vector'].agg(lambda x: list(chain.from_iterable(x)))
# much faster!
# 8.02 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Answered By: mozway
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.