How to split comma separated text into columns on pandas dataframe?

Question:

I have a dataframe where one of the columns has its items separated with commas. It looks like:

Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e

My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:

Data a b c d e
a,b,c 1 1 1 0 0
a,c,d 1 0 1 1 0
d,e 0 0 0 1 1
a,e 1 0 0 0 1
a,b,c,d,e 1 1 1 1 1

To separate column Data what I did is:

df['data'].str.split(',', expand = True)

Then I don’t know how to proceed to allocate the flags to each of the columns.

Asked By: alelew

||

Answers:

If you split the strings into lists, then explode them, it makes pivot possible.

(df.assign(data_list=df.Data.str.split(','))
   .explode('data_list')
   .pivot_table(index='Data',
                columns='data_list',
                aggfunc=lambda x: 1,
                fill_value=0))

Output

data_list  a  b  c  d  e
Data                    
a,b,c      1  1  1  0  0
a,b,c,d,e  1  1  1  1  1
a,c,d      1  0  1  1  0
a,e        1  0  0  0  1
d,e        0  0  0  1  1
Answered By: Chris

You could apply a custom count function for each key:

for k in ["a","b","c","d","e"]:
    df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)
Answered By: rammelmueller

Maybe you can try this without pivot.

Create the dataframe.

import pandas as pd
import io

s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''

df = pd.read_csv(io.StringIO(s), sep = "s+")

We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.

Finally fillna with zero and change the data into integer with astype(int).

df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)

#
    a   b   c   d   e
0   1   1   1   0   0
1   1   0   1   1   0
2   0   0   0   1   1
3   1   0   0   0   1
4   1   1   1   1   1

And then merge it with the original column.

new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)

#
    Data        a   b   c   d   e
0   a,b,c       1   1   1   0   0
1   a,c,d       1   0   1   1   0
2   d,e         0   0   0   1   1
3   a,e         1   0   0   0   1
4   a,b,c,d,e   1   1   1   1   1
Answered By: Denny Chen

Use the Series.str.get_dummies() method to return the required matrix of ‘a’, ‘b’, … ‘e’ columns.

df["Data"].str.get_dummies(sep=',')
Answered By: ag79
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.