Creating new rows from single cell strings in pandas dataframe

Question:

I have a pandas dataframe with output scraped directly from a USDA text file. Below is an example of of the dataframe:

Date       Region                 CommodityGroup                    InboundCity  Low    High   
    1/2/2019   Mexico Crossings       Beans,Cucumbers,Eggplant,Melons   Atlanta      4500   4700
    1/2/2019   Eastern North Carolina Apples and Pears                  Baltimore    7000   8000
    1/2/2019   Michigan               Apples                            Boston       3800   4000

I am looking for a programmatic solution to break up the multiple commodity (each commodity is separated by commas or "and" in the above table) cells in the "CommodityGroups" column, create new rows for the separated commodities, and duplicate the rest of column data for each new row. Desired example output:

Date       Region                    CommodityGroup     InboundCity     Low     High
    1/2/2019   Mexico Crossings          Beans              Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Cucumbers          Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Eggplant           Atlanta         4500    4700
    1/2/2019   Mexico Crossings          Melons             Atlanta         4500    4700
    1/2/2019   Eastern North Carolina    Apples             Baltimore       7000    8000
    1/2/2019   Eastern North Carolina    Pears              Baltimore       7000    8000
    1/2/2019   Michigan                  Apples             Boston          3800    4000

Any guidance you can provide in this pursuit will be greatly appreciated!

Asked By: Danny Coveney

||

Answers:

  • Use .str.split to split the column with a pattern ',| and ', which is ',' or ' and '. '|' is OR.
  • Use .explode to separate list elements into separate rows
    • Optionally, set ignore_index=True where the resulting index will be labeled 0, 1, …, n – 1, depending on your needs.
import pandas as pd

# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
        'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
        'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
        'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
        'Low': [4500, 7000, 3800],
        'High': [4700, 8000, 4000]}

# create the dataframe
df = pd.DataFrame(data)

# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')

# explode the CommodityGroup lists
df = df.explode('CommodityGroup')

# final
       Date                  Region CommodityGroup InboundCity   Low  High
0  1/2/2019        Mexico Crossings          Beans     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings      Cucumbers     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings       Eggplant     Atlanta  4500  4700
0  1/2/2019        Mexico Crossings         Melons     Atlanta  4500  4700
1  1/2/2019  Eastern North Carolina         Apples   Baltimore  7000  8000
1  1/2/2019  Eastern North Carolina          Pears   Baltimore  7000  8000
2  1/2/2019                Michigan         Apples      Boston  3800  4000
Answered By: Trenton McKinney

You can try this:

df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
   .apply(lambda x: x.str.split(',| and ').explode())
   .reset_index() 
print(df)

       Date                  Region InboundCity   Low  High CommodityGroup
0  1/2/2019        Mexico Crossings     Atlanta  4500  4700          Beans
1  1/2/2019        Mexico Crossings     Atlanta  4500  4700      Cucumbers
2  1/2/2019        Mexico Crossings     Atlanta  4500  4700       Eggplant
3  1/2/2019        Mexico Crossings     Atlanta  4500  4700         Melons
4  1/2/2019  Eastern North Carolina   Baltimore  7000  8000         Apples
5  1/2/2019  Eastern North Carolina   Baltimore  7000  8000          Pears
6  1/2/2019                Michigan      Boston  3800  4000         Apples
Answered By: NYC Coder
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.