Pandas split column into multiple columns by comma
Question:
I am trying to split a column into multiple columns based on comma/space separation.
My dataframe currently looks like
KEYS 1
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420, 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490, 4000.0499, 4000.0500, 4000.0504,
I would like
KEYS 1 2 3 4
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490 4000.0499 4000.0500 4000.0504
My code currently removes The KEYS column and I’m not sure why. Could anyone improve or help fix the issue?
v = dfcleancsv[1]
#splits the columns by spaces into new columns but removes KEYS?
dfcleancsv = dfcleancsv[1].str.split(' ').apply(Series, 1)
Answers:
Using Edchums answer of
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
I was able to solve it by substituting my variables.
dfcleancsv = pd.concat([dfcleancsv['KEYS'], dfcleancsv[1].str.split(', ', expand=True)], axis=1)
maybe this should work:
df = pd.concat([df['KEYS'],df[1].apply(pd.Series)],axis=1)
In case someone else wants to split a single column (deliminated by a value) into multiple columns – try this:
series.str.split(',', expand=True)
This answered the question I came here looking for.
Credit to EdChum’s code that includes adding the split columns back to the dataframe.
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
Note: The first argument df[[0]]
is DataFrame
.
The second argument df[1].str.split
is the series that you want to split.
The OP had a variable number of output columns.
In the particular case of a fixed number of output columns another elegant solution to name the resulting columns is to use a multiple assignation.
Load a sample dataset and reshape it to long format to obtain a variable
called organ_dimension
.
import seaborn
iris = seaborn.load_dataset('iris')
df = iris.melt(id_vars='species', var_name='organ_dimension', value_name='value')
Split the organ_dimension
variable in 2 variables organ
and dimension
based on the _
separator.
df[['organ', 'dimension']] = df['organ_dimension'].str.split('_', expand=True)
df.head()
Out[10]:
species organ_dimension value organ dimension
0 setosa sepal_length 5.1 sepal length
1 setosa sepal_length 4.9 sepal length
2 setosa sepal_length 4.7 sepal length
3 setosa sepal_length 4.6 sepal length
4 setosa sepal_length 5.0 sepal length
Based on this answer "How to split a column into two columns?"
The simplest way to use is, vectorization
df = df.apply(lambda x:pd.Series(x))
Check this out
Responder_id LanguagesWorkedWith
0 1 HTML/CSS;Java;JavaScript;Python
1 2 C++;HTML/CSS;Python
2 3 HTML/CSS
3 4 C;C++;C#;Python;SQL
4 5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
... ... ...
87564 88182 HTML/CSS;Java;JavaScript
87565 88212 HTML/CSS;JavaScript;Python
87566 88282 Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...
87567 88377 HTML/CSS;JavaScript;Other(s):
87568 88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...`
###Split the LanguagesWorkedWith column into multiple columns by using` data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)`.###
` data1 = pd.read_csv('data.csv', sep=',')
data1.set_index('Responder_id',inplace=True)
data1
data1.loc[1,:]
data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)
data.head()`
You may also want to try datar
, a package ports dplyr
, tidyr
and related R packages to python:
>>> df
i j A
<object> <int64> <object>
0 AR 5 Paris,Green
1 For 3 Moscow,Yellow
2 For 4 NewYork,Black
>>> from datar import f
>>> from datar.tidyr import separate
>>> separate(df, f.A, ['City', 'Color'])
i j City Color
<object> <int64> <object> <object>
0 AR 5 Paris Green
1 For 3 Moscow Yellow
2 For 4 NewYork Black
None of the other answers seemed to work for me. This worked after reading the docs at https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html. This also expanded into new columns
df_new = df[0].str.split(',', expand=True)
I am trying to split a column into multiple columns based on comma/space separation.
My dataframe currently looks like
KEYS 1
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420, 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490, 4000.0499, 4000.0500, 4000.0504,
I would like
KEYS 1 2 3 4
0 FIT-4270 4000.0439
1 FIT-4269 4000.0420 4000.0471
2 FIT-4268 4000.0419
3 FIT-4266 4000.0499
4 FIT-4265 4000.0490 4000.0499 4000.0500 4000.0504
My code currently removes The KEYS column and I’m not sure why. Could anyone improve or help fix the issue?
v = dfcleancsv[1]
#splits the columns by spaces into new columns but removes KEYS?
dfcleancsv = dfcleancsv[1].str.split(' ').apply(Series, 1)
Using Edchums answer of
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
I was able to solve it by substituting my variables.
dfcleancsv = pd.concat([dfcleancsv['KEYS'], dfcleancsv[1].str.split(', ', expand=True)], axis=1)
maybe this should work:
df = pd.concat([df['KEYS'],df[1].apply(pd.Series)],axis=1)
In case someone else wants to split a single column (deliminated by a value) into multiple columns – try this:
series.str.split(',', expand=True)
This answered the question I came here looking for.
Credit to EdChum’s code that includes adding the split columns back to the dataframe.
pd.concat([df[[0]], df[1].str.split(', ', expand=True)], axis=1)
Note: The first argument df[[0]]
is DataFrame
.
The second argument df[1].str.split
is the series that you want to split.
The OP had a variable number of output columns.
In the particular case of a fixed number of output columns another elegant solution to name the resulting columns is to use a multiple assignation.
Load a sample dataset and reshape it to long format to obtain a variable
called organ_dimension
.
import seaborn
iris = seaborn.load_dataset('iris')
df = iris.melt(id_vars='species', var_name='organ_dimension', value_name='value')
Split the organ_dimension
variable in 2 variables organ
and dimension
based on the _
separator.
df[['organ', 'dimension']] = df['organ_dimension'].str.split('_', expand=True)
df.head()
Out[10]:
species organ_dimension value organ dimension
0 setosa sepal_length 5.1 sepal length
1 setosa sepal_length 4.9 sepal length
2 setosa sepal_length 4.7 sepal length
3 setosa sepal_length 4.6 sepal length
4 setosa sepal_length 5.0 sepal length
Based on this answer "How to split a column into two columns?"
The simplest way to use is, vectorization
df = df.apply(lambda x:pd.Series(x))
Check this out
Responder_id LanguagesWorkedWith
0 1 HTML/CSS;Java;JavaScript;Python
1 2 C++;HTML/CSS;Python
2 3 HTML/CSS
3 4 C;C++;C#;Python;SQL
4 5 C++;HTML/CSS;Java;JavaScript;Python;SQL;VBA
... ... ...
87564 88182 HTML/CSS;Java;JavaScript
87565 88212 HTML/CSS;JavaScript;Python
87566 88282 Bash/Shell/PowerShell;Go;HTML/CSS;JavaScript;W...
87567 88377 HTML/CSS;JavaScript;Other(s):
87568 88863 Bash/Shell/PowerShell;HTML/CSS;Java;JavaScript...`
###Split the LanguagesWorkedWith column into multiple columns by using` data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)`.###
` data1 = pd.read_csv('data.csv', sep=',')
data1.set_index('Responder_id',inplace=True)
data1
data1.loc[1,:]
data= data1['LanguagesWorkedWith'].str.split(';').apply(pd.Series)
data.head()`
You may also want to try datar
, a package ports dplyr
, tidyr
and related R packages to python:
>>> df
i j A
<object> <int64> <object>
0 AR 5 Paris,Green
1 For 3 Moscow,Yellow
2 For 4 NewYork,Black
>>> from datar import f
>>> from datar.tidyr import separate
>>> separate(df, f.A, ['City', 'Color'])
i j City Color
<object> <int64> <object> <object>
0 AR 5 Paris Green
1 For 3 Moscow Yellow
2 For 4 NewYork Black
None of the other answers seemed to work for me. This worked after reading the docs at https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html. This also expanded into new columns
df_new = df[0].str.split(',', expand=True)