Splitting a column with delimiter and place a value in the right column
Question:
I have a data frame with a column that potentially can be filled with 3 options (a,b, and/or c) with a comma delimiter.
import pandas as pd
df = pd.DataFrame({'col1':['a,b,c', 'b', 'a,c', 'b,c', 'a,b']})
I want to split this column based on ‘,’
df['col1'].str.split(',', expand=True)
A problem with this is that new columns are filled from the first column where I want to fill the columns based on values.
For example all a’s in the first column, b’s in the second column, c’s in the third column.
Answers:
Instead of expand, we explode into a long format then pivot.
df['col1'].str.split(',').explode().reset_index().pivot(index = 'index', columns = 'col1', values = 'col1')
Here is another method, using .crosstab
:
df = df.assign(col1=df["col1"].str.split(",")).explode("col1")
df = pd.crosstab(df.index, df["col1"]).rename_axis(index=None, columns=None)
df = df * df.columns # if you want only 0-1 indices if there's value, you can omit this step
print(df)
Prints:
a b c
0 a b c
1 b
2 a c
3 b c
4 a b
To rename columns:
df = df.rename(columns={"a": "col1", "b": "col2", "c": "col3"})
Using str.get_dummies
:
tmp = df['col1'].str.get_dummies(',')
out = tmp.mul(tmp.columns)
Output:
a b c
0 a b c
1 b
2 a c
3 b c
4 a b
With NaNs and custom headers:
tmp = df['col1'].str.get_dummies(',')
out = (tmp.mul(tmp.columns).where(tmp>0)
.rename(columns={'a': 'X', 'b': 'Y', 'c': 'Z'})
)
Output:
X Y Z
0 a b c
1 NaN b NaN
2 a NaN c
3 NaN b c
4 a b NaN
I have a data frame with a column that potentially can be filled with 3 options (a,b, and/or c) with a comma delimiter.
import pandas as pd
df = pd.DataFrame({'col1':['a,b,c', 'b', 'a,c', 'b,c', 'a,b']})
I want to split this column based on ‘,’
df['col1'].str.split(',', expand=True)
A problem with this is that new columns are filled from the first column where I want to fill the columns based on values.
For example all a’s in the first column, b’s in the second column, c’s in the third column.
Instead of expand, we explode into a long format then pivot.
df['col1'].str.split(',').explode().reset_index().pivot(index = 'index', columns = 'col1', values = 'col1')
Here is another method, using .crosstab
:
df = df.assign(col1=df["col1"].str.split(",")).explode("col1")
df = pd.crosstab(df.index, df["col1"]).rename_axis(index=None, columns=None)
df = df * df.columns # if you want only 0-1 indices if there's value, you can omit this step
print(df)
Prints:
a b c
0 a b c
1 b
2 a c
3 b c
4 a b
To rename columns:
df = df.rename(columns={"a": "col1", "b": "col2", "c": "col3"})
Using str.get_dummies
:
tmp = df['col1'].str.get_dummies(',')
out = tmp.mul(tmp.columns)
Output:
a b c
0 a b c
1 b
2 a c
3 b c
4 a b
With NaNs and custom headers:
tmp = df['col1'].str.get_dummies(',')
out = (tmp.mul(tmp.columns).where(tmp>0)
.rename(columns={'a': 'X', 'b': 'Y', 'c': 'Z'})
)
Output:
X Y Z
0 a b c
1 NaN b NaN
2 a NaN c
3 NaN b c
4 a b NaN