Pandas convert a column of list to dummies

Question

I have a dataframe where one column is a list of groups each of my users belongs to. Something like:

index groups  
0     ['a','b','c']
1     ['c']
2     ['b','c','e']
3     ['a','c']
4     ['b','e']

And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses

index  a   b   c   d   e
0      1   1   1   0   0
1      0   0   1   0   0
2      0   1   1   0   1
3      1   0   1   0   0
4      0   1   0   0   0


pd.get_dummies(df['groups'])

won’t work because that just returns a column for each different list in my column.

The solution needs to be efficient as the dataframe will contain 500,000+ rows.

Asked By: user2900369

||

Source

Answer 1

Using s for your df['groups']:

In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })

In [22]: s
Out[22]:
0    [a, b, c]
1          [c]
2    [b, c, e]
3       [a, c]
4       [b, e]
dtype: object

This is a possible solution:

In [23]: pd.get_dummies(s.explode()).groupby(level=0).sum()
Out[23]:
   a  b  c  e
0  1  1  1  0
1  0  0  1  0
2  0  1  1  1
3  1  0  1  0
4  0  1  0  1

The logic of this is:

.explode() flattens the series of lists to a series of single values (with the index keeping track of the original row number)
pd.get_dummies( ) creating the dummies
.groupby(level=0).sum() for combining the different rows that should be one row (by summing up grouped by the index (level=0), i.e. the original row number))

If this will be efficient enough, I don’t know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.

Updates since original answer

Since version 0.25, s.explode() can be used to flatten the Series of lists, instead of the original s.apply(pd.Series).stack()
Since version 1.3.0, using the level keyword in aggregations is deprecated and will be removed from newer versions soon, so is recommended to use df.groupby(level=0).sum() instead of df.sum(level=0)

Answered By: joris

Answer 2

Even though this quest was answered, I have a faster solution:

df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

And, in case you have empty groups or NaN, you could just:

df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')

How it works

Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:

In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]: 
a    1
b    1
c    1
dtype: int64

When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:

In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]: 
     a    b    c    d
0  1.0  1.0  1.0  NaN
1  1.0  1.0  NaN  1.0

Now fillna fills those NaN with 0:

In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]: 
     a    b    c    d
0  1.0  1.0  1.0  0.0
1  1.0  1.0  0.0  1.0

And downcast='infer' is to downcast from float to int:

In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]: 
   a  b  c  d
0  1  1  1  0
1  1  1  0  1

PS.: It’s not required the use of .fillna(0, downcast='infer').

Answered By: Paulo Alves

Answer 3

Very fast solution in case you have a large dataframe

Using sklearn.preprocessing.MultiLabelBinarizer

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer

df = pd.DataFrame(
    {'groups':
        [['a','b','c'],
        ['c'],
        ['b','c','e'],
        ['a','c'],
        ['b','e']]
    }, columns=['groups'])

s = df['groups']

mlb = MultiLabelBinarizer()

pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)

Result:

    a   b   c   e
0   1   1   1   0
1   0   0   1   0
2   0   1   1   1
3   1   0   1   0
4   0   1   0   1

Worked for me and also was suggested here and here

Answered By: Teoretic

Answer 4

This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)

Using .explode() instead of .apply(pd.Series).stack()

Comparing with the other solutions:

import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
#           m1        m2        m3
# ms  5.586517  3.821662  2.547167

Answered By: RBA

Answer 5

You can use str.join to join all elements in list present in series into string and then use str.get_dummies:

out = df.join(df['groups'].str.join('|').str.get_dummies())

print(out)

      groups  a  b  c  e
0  [a, b, c]  1  1  1  0
1        [c]  0  0  1  0
2  [b, c, e]  0  1  1  1
3     [a, c]  1  0  1  0
4     [b, e]  0  1  0  1

Answered By: Ynjxsjmh

Answer 6

You can use explode and crosstab:

s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])

s = s.explode()
pd.crosstab(s.index, s)

Output:

col_0  a  b  c  e
row_0            
0      1  1  1  0
1      0  0  1  0
2      0  1  1  1
3      1  0  1  0
4      0  1  0  1

Answered By: Mykola Zotko

Pandas convert a column of list to dummies

Question:

Answers:

Updates since original answer

How it works