How to extract comma separated values to individual rows
Question:
This is my dataframe (where the values in the authors column are comma separated strings):
authors book
Jim, Charles The Greatest Book in the World
Jim An OK book
Charlotte A book about books
Charlotte, Jim The last book
How do I transform it to a long format, like this:
authors book
Jim The Greatest Book in the World
Jim An OK book
Jim The last book
Charles The Greatest Book in the World
Charlotte A book about books
Charlotte The last book
I’ve tried extracting the individual authors to a list, authors = list(df['authors'].str.split(','))
, flatten that list, matched every author to every book, and construct a new list of dicts with every match. But that doesn’t seem very pythonic to me, and I’m guessing pandas has a cleaner way to do this.
Answers:
You can split the authors column by column after setting the index to the book which will get you almost all the way there. Rename and sort columns to finish.
df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')
book 0
0 The Greatest Book in the World Jim
1 The Greatest Book in the World Charles
0 An OK book Jim
0 A book about books Charlotte
0 The last book Charlotte
1 The last book Jim
And to get you all the way home
df.set_index('book')
.authors.str.split(',', expand=True)
.stack()
.reset_index('book')
.rename(columns={0:'authors'})
.sort_values('authors')[['authors', 'book']]
.reset_index(drop=True)
- The best option is to use
pandas.Series.str.split
, and then to pandas.DataFrame.explode
the list
.
- Split on
', '
, otherwise values following the comma will be preceded by a whitespace (e.g. ' Charles'
)
- Tested in
python 3.10
, pandas 1.4.3
import pandas as pd
data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}
df = pd.DataFrame(data)
# display(df)
authors book
0 Jim, Charles The Greatest Book in the World
1 Jim An OK book
2 Charlotte A book about books
3 Charlotte, Jim The last book
# split authors
df.authors = df.authors.str.split(', ')
# explode the column (with a fresh 0, 1... index)
df = df.explode('authors', ignore_index=True)
# display(df)
authors book
0 Jim The Greatest Book in the World
1 Charles The Greatest Book in the World
2 Jim An OK book
3 Charlotte A book about books
4 Charlotte The last book
5 Jim The last book
This is my dataframe (where the values in the authors column are comma separated strings):
authors book
Jim, Charles The Greatest Book in the World
Jim An OK book
Charlotte A book about books
Charlotte, Jim The last book
How do I transform it to a long format, like this:
authors book
Jim The Greatest Book in the World
Jim An OK book
Jim The last book
Charles The Greatest Book in the World
Charlotte A book about books
Charlotte The last book
I’ve tried extracting the individual authors to a list, authors = list(df['authors'].str.split(','))
, flatten that list, matched every author to every book, and construct a new list of dicts with every match. But that doesn’t seem very pythonic to me, and I’m guessing pandas has a cleaner way to do this.
You can split the authors column by column after setting the index to the book which will get you almost all the way there. Rename and sort columns to finish.
df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')
book 0
0 The Greatest Book in the World Jim
1 The Greatest Book in the World Charles
0 An OK book Jim
0 A book about books Charlotte
0 The last book Charlotte
1 The last book Jim
And to get you all the way home
df.set_index('book')
.authors.str.split(',', expand=True)
.stack()
.reset_index('book')
.rename(columns={0:'authors'})
.sort_values('authors')[['authors', 'book']]
.reset_index(drop=True)
- The best option is to use
pandas.Series.str.split
, and then topandas.DataFrame.explode
thelist
.- Split on
', '
, otherwise values following the comma will be preceded by a whitespace (e.g.' Charles'
)
- Split on
- Tested in
python 3.10
,pandas 1.4.3
import pandas as pd
data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}
df = pd.DataFrame(data)
# display(df)
authors book
0 Jim, Charles The Greatest Book in the World
1 Jim An OK book
2 Charlotte A book about books
3 Charlotte, Jim The last book
# split authors
df.authors = df.authors.str.split(', ')
# explode the column (with a fresh 0, 1... index)
df = df.explode('authors', ignore_index=True)
# display(df)
authors book
0 Jim The Greatest Book in the World
1 Charles The Greatest Book in the World
2 Jim An OK book
3 Charlotte A book about books
4 Charlotte The last book
5 Jim The last book