Pandas column of lists, create a row for each list element

Question:

I have a dataframe where some cells contain lists of multiple values. Rather than storing multiple
values in a cell, I’d like to expand the dataframe so that each item in the list gets its own row (with the same values in all other columns). So if I have:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {'trial_num': [1, 2, 3, 1, 2, 3],
     'subject': [1, 1, 1, 2, 2, 2],
     'samples': [list(np.random.randn(3).round(2)) for i in range(6)]
    }
)

df
Out[10]: 
                 samples  subject  trial_num
0    [0.57, -0.83, 1.44]        1          1
1    [-0.01, 1.13, 0.36]        1          2
2   [1.18, -1.46, -0.94]        1          3
3  [-0.08, -4.22, -2.05]        2          1
4     [0.72, 0.79, 0.53]        2          2
5    [0.4, -0.32, -0.13]        2          3

How do I convert to long form, e.g.:

   subject  trial_num  sample  sample_num
0        1          1    0.57           0
1        1          1   -0.83           1
2        1          1    1.44           2
3        1          2   -0.01           0
4        1          2    1.13           1
5        1          2    0.36           2
6        1          3    1.18           0
# etc.

The index is not important, it’s OK to set existing
columns as the index and the final ordering isn’t
important.

Asked By: Marius

||

Answers:

A bit longer than I expected:

>>> df
                samples  subject  trial_num
0  [-0.07, -2.9, -2.44]        1          1
1   [-1.52, -0.35, 0.1]        1          2
2  [-0.17, 0.57, -0.65]        1          3
3  [-0.82, -1.06, 0.47]        2          1
4   [0.79, 1.35, -0.09]        2          2
5   [1.17, 1.14, -1.79]        2          3
>>>
>>> s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
>>> s.name = 'sample'
>>>
>>> df.drop('samples', axis=1).join(s)
   subject  trial_num  sample
0        1          1   -0.07
0        1          1   -2.90
0        1          1   -2.44
1        1          2   -1.52
1        1          2   -0.35
1        1          2    0.10
2        1          3   -0.17
2        1          3    0.57
2        1          3   -0.65
3        2          1   -0.82
3        2          1   -1.06
3        2          1    0.47
4        2          2    0.79
4        2          2    1.35
4        2          2   -0.09
5        2          3    1.17
5        2          3    1.14
5        2          3   -1.79

If you want sequential index, you can apply reset_index(drop=True) to the result.

update:

>>> res = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack()
>>> res = res.reset_index()
>>> res.columns = ['subject','trial_num','sample_num','sample']
>>> res
    subject  trial_num  sample_num  sample
0         1          1           0    1.89
1         1          1           1   -2.92
2         1          1           2    0.34
3         1          2           0    0.85
4         1          2           1    0.24
5         1          2           2    0.72
6         1          3           0   -0.96
7         1          3           1   -2.72
8         1          3           2   -0.11
9         2          1           0   -1.33
10        2          1           1    3.13
11        2          1           2   -0.65
12        2          2           0    0.10
13        2          2           1    0.65
14        2          2           2    0.15
15        2          3           0    0.64
16        2          3           1   -0.10
17        2          3           2   -0.76
Answered By: Roman Pekar

Trying to work through Roman Pekar’s solution step-by-step to understand it better, I came up with my own solution, which uses melt to avoid some of the confusing stacking and index resetting. I can’t say that it’s obviously a clearer solution though:

items_as_cols = df.apply(lambda x: pd.Series(x['samples']), axis=1)
# Keep original df index as a column so it's retained after melt
items_as_cols['orig_index'] = items_as_cols.index

melted_items = pd.melt(items_as_cols, id_vars='orig_index', 
                       var_name='sample_num', value_name='sample')
melted_items.set_index('orig_index', inplace=True)

df.merge(melted_items, left_index=True, right_index=True)

Output (obviously we can drop the original samples column now):

                 samples  subject  trial_num sample_num  sample
0    [1.84, 1.05, -0.66]        1          1          0    1.84
0    [1.84, 1.05, -0.66]        1          1          1    1.05
0    [1.84, 1.05, -0.66]        1          1          2   -0.66
1    [-0.24, -0.9, 0.65]        1          2          0   -0.24
1    [-0.24, -0.9, 0.65]        1          2          1   -0.90
1    [-0.24, -0.9, 0.65]        1          2          2    0.65
2    [1.15, -0.87, -1.1]        1          3          0    1.15
2    [1.15, -0.87, -1.1]        1          3          1   -0.87
2    [1.15, -0.87, -1.1]        1          3          2   -1.10
3   [-0.8, -0.62, -0.68]        2          1          0   -0.80
3   [-0.8, -0.62, -0.68]        2          1          1   -0.62
3   [-0.8, -0.62, -0.68]        2          1          2   -0.68
4    [0.91, -0.47, 1.43]        2          2          0    0.91
4    [0.91, -0.47, 1.43]        2          2          1   -0.47
4    [0.91, -0.47, 1.43]        2          2          2    1.43
5  [-1.14, -0.24, -0.91]        2          3          0   -1.14
5  [-1.14, -0.24, -0.91]        2          3          1   -0.24
5  [-1.14, -0.24, -0.91]        2          3          2   -0.91
Answered By: Marius

you can also use pd.concat and pd.melt for this:

>>> objs = [df, pd.DataFrame(df['samples'].tolist())]
>>> pd.concat(objs, axis=1).drop('samples', axis=1)
   subject  trial_num     0     1     2
0        1          1 -0.49 -1.00  0.44
1        1          2 -0.28  1.48  2.01
2        1          3 -0.52 -1.84  0.02
3        2          1  1.23 -1.36 -1.06
4        2          2  0.54  0.18  0.51
5        2          3 -2.18 -0.13 -1.35
>>> pd.melt(_, var_name='sample_num', value_name='sample', 
...         value_vars=[0, 1, 2], id_vars=['subject', 'trial_num'])
    subject  trial_num sample_num  sample
0         1          1          0   -0.49
1         1          2          0   -0.28
2         1          3          0   -0.52
3         2          1          0    1.23
4         2          2          0    0.54
5         2          3          0   -2.18
6         1          1          1   -1.00
7         1          2          1    1.48
8         1          3          1   -1.84
9         2          1          1   -1.36
10        2          2          1    0.18
11        2          3          1   -0.13
12        1          1          2    0.44
13        1          2          2    2.01
14        1          3          2    0.02
15        2          1          2   -1.06
16        2          2          2    0.51
17        2          3          2   -1.35

last, if you need you can sort base on the first the first three columns.

Answered By: behzad.nouri

For those looking for a version of Roman Pekar’s answer that avoids manual column naming:

column_to_explode = 'samples'
res = (df
       .set_index([x for x in df.columns if x != column_to_explode])[column_to_explode]
       .apply(pd.Series)
       .stack()
       .reset_index())
res = res.rename(columns={
          res.columns[-2]:'exploded_{}_index'.format(column_to_explode),
          res.columns[-1]: '{}_exploded'.format(column_to_explode)})
Answered By: Charles Davis

UPDATE: the solution below was helpful for older Pandas versions, because the DataFrame.explode() wasn’t available. Starting from Pandas 0.25.0 you can simply use DataFrame.explode().


lst_col = 'samples'

r = pd.DataFrame({
      col:np.repeat(df[col].values, df[lst_col].str.len())
      for col in df.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(df[lst_col].values)})[df.columns]

Result:

In [103]: r
Out[103]:
    samples  subject  trial_num
0      0.10        1          1
1     -0.20        1          1
2      0.05        1          1
3      0.25        1          2
4      1.32        1          2
5     -0.17        1          2
6      0.64        1          3
7     -0.22        1          3
8     -0.71        1          3
9     -0.03        2          1
10    -0.65        2          1
11     0.76        2          1
12     1.77        2          2
13     0.89        2          2
14     0.65        2          2
15    -0.98        2          3
16     0.65        2          3
17    -0.30        2          3

PS here you may find a bit more generic solution


UPDATE: some explanations: IMO the easiest way to understand this code is to try to execute it step-by-step:

in the following line we are repeating values in one column N times where N – is the length of the corresponding list:

In [10]: np.repeat(df['trial_num'].values, df[lst_col].str.len())
Out[10]: array([1, 1, 1, 2, 2, 2, 3, 3, 3, 1, 1, 1, 2, 2, 2, 3, 3, 3], dtype=int64)

this can be generalized for all columns, containing scalar values:

In [11]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         )
Out[11]:
    trial_num  subject
0           1        1
1           1        1
2           1        1
3           2        1
4           2        1
5           2        1
6           3        1
..        ...      ...
11          1        2
12          2        2
13          2        2
14          2        2
15          3        2
16          3        2
17          3        2

[18 rows x 2 columns]

using np.concatenate() we can flatten all values in the list column (samples) and get a 1D vector:

In [12]: np.concatenate(df[lst_col].values)
Out[12]: array([-1.04, -0.58, -1.32,  0.82, -0.59, -0.34,  0.25,  2.09,  0.12,  0.83, -0.88,  0.68,  0.55, -0.56,  0.65, -0.04,  0.36, -0.31])

putting all this together:

In [13]: pd.DataFrame({
    ...:           col:np.repeat(df[col].values, df[lst_col].str.len())
    ...:           for col in df.columns.drop(lst_col)}
    ...:         ).assign(**{lst_col:np.concatenate(df[lst_col].values)})
Out[13]:
    trial_num  subject  samples
0           1        1    -1.04
1           1        1    -0.58
2           1        1    -1.32
3           2        1     0.82
4           2        1    -0.59
5           2        1    -0.34
6           3        1     0.25
..        ...      ...      ...
11          1        2     0.68
12          2        2     0.55
13          2        2    -0.56
14          2        2     0.65
15          3        2    -0.04
16          3        2     0.36
17          3        2    -0.31

[18 rows x 3 columns]

using pd.DataFrame()[df.columns] will guarantee that we are selecting columns in the original order…

I found the easiest way was to:

  1. Convert the samples column into a DataFrame
  2. Joining with the original df
  3. Melting

Shown here:

    df.samples.apply(lambda x: pd.Series(x)).join(df).
melt(['subject','trial_num'],[0,1,2],var_name='sample')

        subject  trial_num sample  value
    0         1          1      0  -0.24
    1         1          2      0   0.14
    2         1          3      0  -0.67
    3         2          1      0  -1.52
    4         2          2      0  -0.00
    5         2          3      0  -1.73
    6         1          1      1  -0.70
    7         1          2      1  -0.70
    8         1          3      1  -0.29
    9         2          1      1  -0.70
    10        2          2      1  -0.72
    11        2          3      1   1.30
    12        1          1      2  -0.55
    13        1          2      2   0.10
    14        1          3      2  -0.44
    15        2          1      2   0.13
    16        2          2      2  -1.44
    17        2          3      2   0.73

It’s worth noting that this may have only worked because each trial has the same number of samples (3). Something more clever may be necessary for trials of different sample sizes.

Answered By: Michael Silverstein

Very late answer but I want to add this:

A fast solution using vanilla Python that also takes care of the sample_num column in OP’s example. On my own large dataset with over 10 million rows and a result with 28 million rows this only takes about 38 seconds. The accepted solution completely breaks down with that amount of data and leads to a memory error on my system that has 128GB of RAM.

df = df.reset_index(drop=True)
lstcol = df.lstcol.values
lstcollist = []
indexlist = []
countlist = []
for ii in range(len(lstcol)):
    lstcollist.extend(lstcol[ii])
    indexlist.extend([ii]*len(lstcol[ii]))
    countlist.extend([jj for jj in range(len(lstcol[ii]))])
df = pd.merge(df.drop("lstcol",axis=1),pd.DataFrame({"lstcol":lstcollist,"lstcol_num":countlist},
index=indexlist),left_index=True,right_index=True).reset_index(drop=True)
Answered By: Khris

Pandas >= 0.25

Series and DataFrame methods define a .explode() method that explodes lists into separate rows. See the docs section on Exploding a list-like column.

df = pd.DataFrame({
    'var1': [['a', 'b', 'c'], ['d', 'e',], [], np.nan], 
    'var2': [1, 2, 3, 4]
})
df
        var1  var2
0  [a, b, c]     1
1     [d, e]     2
2         []     3
3        NaN     4

df.explode('var1')

  var1  var2
0    a     1
0    b     1
0    c     1
1    d     2
1    e     2
2  NaN     3  # empty list converted to NaN
3  NaN     4  # NaN entry preserved as-is

# to reset the index to be monotonically increasing...
df.explode('var1').reset_index(drop=True)

  var1  var2
0    a     1
1    b     1
2    c     1
3    d     2
4    e     2
5  NaN     3
6  NaN     4

Note that this also handles mixed columns of lists and scalars, as well as empty lists and NaNs appropriately (this is a drawback of repeat-based solutions).

However, you should note that explode only works on a single column (for now).

P.S.: if you are looking to explode a column of strings, you need to split on a separator first, then use explode. See this (very much) related answer by me.

Answered By: cs95
import pandas as pd
df = pd.DataFrame([{'Product': 'Coke', 'Prices': [100,123,101,105,99,94,98]},{'Product': 'Pepsi', 'Prices': [101,104,104,101,99,99,99]}])
print(df)
df = df.assign(Prices=df.Prices.str.split(',')).explode('Prices')
print(df)

Try this in pandas >=0.25 version

Answered By: Tapas

Also very late, but here is an answer from Karvy1 that worked well for me if you don’t have pandas >=0.25 version: https://stackoverflow.com/a/52511166/10740287

For the example above you may write:

data = [(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples]
data = pd.DataFrame(data, columns=['subject', 'trial_num', 'samples'])

Speed test:

%timeit data = pd.DataFrame([(row.subject, row.trial_num, sample) for row in df.itertuples() for sample in row.samples], columns=['subject', 'trial_num', 'samples'])

1.33 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit data = df.set_index(['subject', 'trial_num'])['samples'].apply(pd.Series).stack().reset_index()

4.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit data = pd.DataFrame({col:np.repeat(df[col].values, df['samples'].str.len())for col in df.columns.drop('samples')}).assign(**{'samples':np.concatenate(df['samples'].values)})

1.38 ms ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Answered By: Rémy Pétremand
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.