How to extract comma separated values to individual rows

Question:

This is my dataframe (where the values in the authors column are comma separated strings):

authors            book
Jim, Charles       The Greatest Book in the World
Jim                An OK book
Charlotte          A book about books
Charlotte, Jim     The last book

How do I transform it to a long format, like this:

authors            book
Jim                The Greatest Book in the World
Jim                An OK book
Jim                The last book
Charles            The Greatest Book in the World
Charlotte          A book about books
Charlotte          The last book

I’ve tried extracting the individual authors to a list, authors = list(df['authors'].str.split(',')), flatten that list, matched every author to every book, and construct a new list of dicts with every match. But that doesn’t seem very pythonic to me, and I’m guessing pandas has a cleaner way to do this.

Asked By: durrrutti

||

Answers:

You can split the authors column by column after setting the index to the book which will get you almost all the way there. Rename and sort columns to finish.

df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')

                             book          0
0  The Greatest Book in the World        Jim
1  The Greatest Book in the World    Charles
0                      An OK book        Jim
0              A book about books  Charlotte
0                   The last book  Charlotte
1                   The last book        Jim

And to get you all the way home

df.set_index('book')
  .authors.str.split(',', expand=True)
  .stack()
  .reset_index('book')
  .rename(columns={0:'authors'})
  .sort_values('authors')[['authors', 'book']]
  .reset_index(drop=True)
Answered By: Ted Petrou
  • The best option is to use pandas.Series.str.split, and then to pandas.DataFrame.explode the list.
    • Split on ', ', otherwise values following the comma will be preceded by a whitespace (e.g. ' Charles')
  • Tested in python 3.10, pandas 1.4.3
import pandas as pd

data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}

df = pd.DataFrame(data)

# display(df)
          authors                            book
0    Jim, Charles  The Greatest Book in the World
1             Jim                      An OK book
2       Charlotte              A book about books
3  Charlotte, Jim                   The last book

# split authors
df.authors = df.authors.str.split(', ')

# explode the column (with a fresh 0, 1... index)
df = df.explode('authors', ignore_index=True)

# display(df)
     authors                            book
0        Jim  The Greatest Book in the World
1    Charles  The Greatest Book in the World
2        Jim                      An OK book
3  Charlotte              A book about books
4  Charlotte                   The last book
5        Jim                   The last book
Answered By: Trenton McKinney
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.