Add column of empty lists to DataFrame

Question:

Similar to this question How to add an empty column to a dataframe?, I am interested in knowing the best way to add a column of empty lists to a DataFrame.

What I am trying to do is basically initialize a column and as I iterate over the rows to process some of them, then add a filled list in this new column to replace the initialized value.

For example, if below is my initial DataFrame:

df = pd.DataFrame(d = {'a': [1,2,3], 'b': [5,6,7]}) # Sample DataFrame

>>> df
   a  b
0  1  5
1  2  6
2  3  7

Then I want to ultimately end up with something like this, where each row has been processed separately (sample results shown):

>>> df
   a  b          c
0  1  5     [5, 6]
1  2  6     [9, 0]
2  3  7  [1, 2, 3]

Of course, if I try to initialize like df['e'] = [] as I would with any other constant, it thinks I am trying to add a sequence of items with length 0, and hence fails.

If I try initializing a new column as None or NaN, I run in to the following issues when trying to assign a list to a location.

df['d'] = None

>>> df
   a  b     d
0  1  5  None
1  2  6  None
2  3  7  None

Issue 1 (it would be perfect if I can get this approach to work! Maybe something trivial I am missing):

>>> df.loc[0,'d'] = [1,3]

...
ValueError: Must have equal len keys and value when setting with an iterable

Issue 2 (this one works, but not without a warning because it is not guaranteed to work as intended):

>>> df['d'][0] = [1,3]

C:Python27Scriptsipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

Hence I resort to initializing with empty lists and extending them as needed. There are a couple of methods I can think of to initialize this way, but is there a more straightforward way?

Method 1:

df['empty_lists1'] = [list() for x in range(len(df.index))]

>>> df
   a  b   empty_lists1
0  1  5             []
1  2  6             []
2  3  7             []

Method 2:

 df['empty_lists2'] = df.apply(lambda x: [], axis=1)

>>> df
   a  b   empty_lists1   empty_lists2
0  1  5             []             []
1  2  6             []             []
2  3  7             []             []

Summary of questions:

Is there any minor syntax change that can be addressed in Issue 1 that can allow a list to be assigned to a None/NaN initialized field?

If not, then what is the best way to initialize a new column with empty lists?

Asked By: vk1011

||

Answers:

One more way is to use np.empty:

df['empty_list'] = np.empty((len(df), 0)).tolist()

You could also knock off .index in your “Method 1” when trying to find len of df.

df['empty_list'] = [[] for _ in range(len(df))]

Turns out, np.empty is faster…

In [1]: import pandas as pd

In [2]: df = pd.DataFrame(pd.np.random.rand(1000000, 5))

In [3]: timeit df['empty1'] = pd.np.empty((len(df), 0)).tolist()
10 loops, best of 3: 127 ms per loop

In [4]: timeit df['empty2'] = [[] for _ in range(len(df))]
10 loops, best of 3: 193 ms per loop

In [5]: timeit df['empty3'] = df.apply(lambda x: [], axis=1)
1 loops, best of 3: 5.89 s per loop
Answered By: ComputerFellow

EDIT: the commenters caught the bug in my answer

s = pd.Series([[]] * 3)
s.iloc[0].append(1) #adding an item only to the first element
>s # unintended consequences:
0    [1]
1    [1]
2    [1]

So, the correct solution is

s = pd.Series([[] for i in range(3)])
s.iloc[0].append(1)
>s
0    [1]
1     []
2     []

OLD:

I timed all the three methods in the accepted answer, the fastest one took 216 ms on my machine. However, this took only 28 ms:

df['empty4'] = [[]] * len(df)

Note: Similarly, df['e5'] = [set()] * len(df) also took 28ms.

Answered By: tozCSS

Canonical solutions: List comprehension, map and apply

Obligatory disclaimer: avoid using lists in pandas columns where possible, list columns are slow to work with because they are objects and those are inherently hard to vectorize.

With that out of the way, here are the canonical methods of introducing a column of empty lists:

# List comprehension
df['c'] = [[] for _ in range(df.shape[0])]
df

   a  b   c
0  1  5  []
1  2  6  []
2  3  7  []

There’s also these shorthands involving apply and map:

from collections import defaultdict
# map any column with defaultdict
df['c'] = df.iloc[:,0].map(defaultdict(list))
# same as,
df['c'] = df.iloc[:,0].map(lambda _: [])

# apply with defaultdict
df['c'] = df.apply(defaultdict(list), axis=1) 
# same as,
df['c'] = df.apply(lambda _: [], axis=1)

df

   a  b   c
0  1  5  []
1  2  6  []
2  3  7  []

Things you should NOT do

Some folks believe multiplying an empty list is the way to go, unfortunately this is wrong and will usually lead to hard-to-debug issues. Here’s an MVP:

# WRONG
df['c'] = [[]] * len(df) 
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df

   a  b           c
0  1  5  [abc, def]
1  2  6  [abc, def]
2  3  7  [abc, def]

# RIGHT
df['c'] = [[] for _ in range(df.shape[0])]
df.at[0, 'c'].append('abc')
df.at[1, 'c'].append('def')
df

a  b      c
0  1  5  [abc]
1  2  6  [def]
2  3  7     []

In the first case, a single empty list is created and its reference is replicated across all the rows, so you see updates to one reflected to all of them. In the latter case each row is assigned its own empty list, so this is not a concern.

Answered By: cs95
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.