Creating new rows from single cell strings in pandas dataframe
Question:
I have a pandas dataframe with output scraped directly from a USDA text file. Below is an example of of the dataframe:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans,Cucumbers,Eggplant,Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples and Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
I am looking for a programmatic solution to break up the multiple commodity (each commodity is separated by commas or "and" in the above table) cells in the "CommodityGroups" column, create new rows for the separated commodities, and duplicate the rest of column data for each new row. Desired example output:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
Any guidance you can provide in this pursuit will be greatly appreciated!
Answers:
- Use
.str.split
to split the column with a pattern ',| and '
, which is ','
or ' and '
. '|'
is OR
.
- Use
.explode
to separate list elements into separate rows
- Optionally, set
ignore_index=True
where the resulting index will be labeled 0, 1, …, n – 1, depending on your needs.
import pandas as pd
# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
'Low': [4500, 7000, 3800],
'High': [4700, 8000, 4000]}
# create the dataframe
df = pd.DataFrame(data)
# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')
# explode the CommodityGroup lists
df = df.explode('CommodityGroup')
# final
Date Region CommodityGroup InboundCity Low High
0 1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1 1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1 1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
2 1/2/2019 Michigan Apples Boston 3800 4000
You can try this:
df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
.apply(lambda x: x.str.split(',| and ').explode())
.reset_index()
print(df)
Date Region InboundCity Low High CommodityGroup
0 1/2/2019 Mexico Crossings Atlanta 4500 4700 Beans
1 1/2/2019 Mexico Crossings Atlanta 4500 4700 Cucumbers
2 1/2/2019 Mexico Crossings Atlanta 4500 4700 Eggplant
3 1/2/2019 Mexico Crossings Atlanta 4500 4700 Melons
4 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Apples
5 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Pears
6 1/2/2019 Michigan Boston 3800 4000 Apples
I have a pandas dataframe with output scraped directly from a USDA text file. Below is an example of of the dataframe:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans,Cucumbers,Eggplant,Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples and Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
I am looking for a programmatic solution to break up the multiple commodity (each commodity is separated by commas or "and" in the above table) cells in the "CommodityGroups" column, create new rows for the separated commodities, and duplicate the rest of column data for each new row. Desired example output:
Date Region CommodityGroup InboundCity Low High
1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
1/2/2019 Michigan Apples Boston 3800 4000
Any guidance you can provide in this pursuit will be greatly appreciated!
- Use
.str.split
to split the column with a pattern',| and '
, which is','
or' and '
.'|'
isOR
. - Use
.explode
to separate list elements into separate rows- Optionally, set
ignore_index=True
where the resulting index will be labeled 0, 1, …, n – 1, depending on your needs.
- Optionally, set
import pandas as pd
# data
data = {'Date': ['1/2/2019', '1/2/2019', '1/2/2019'],
'Region': ['Mexico Crossings', 'Eastern North Carolina', 'Michigan'],
'CommodityGroup': ['Beans,Cucumbers,Eggplant,Melons', 'Apples and Pears', 'Apples'],
'InboundCity': ['Atlanta', 'Baltimore', 'Boston'],
'Low': [4500, 7000, 3800],
'High': [4700, 8000, 4000]}
# create the dataframe
df = pd.DataFrame(data)
# split the CommodityGroup strings
df.CommodityGroup = df.CommodityGroup.str.split(',| and ')
# explode the CommodityGroup lists
df = df.explode('CommodityGroup')
# final
Date Region CommodityGroup InboundCity Low High
0 1/2/2019 Mexico Crossings Beans Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Cucumbers Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Eggplant Atlanta 4500 4700
0 1/2/2019 Mexico Crossings Melons Atlanta 4500 4700
1 1/2/2019 Eastern North Carolina Apples Baltimore 7000 8000
1 1/2/2019 Eastern North Carolina Pears Baltimore 7000 8000
2 1/2/2019 Michigan Apples Boston 3800 4000
You can try this:
df = df.set_index(['Date', 'Region', 'InboundCity', 'Low', 'High'])
.apply(lambda x: x.str.split(',| and ').explode())
.reset_index()
print(df)
Date Region InboundCity Low High CommodityGroup
0 1/2/2019 Mexico Crossings Atlanta 4500 4700 Beans
1 1/2/2019 Mexico Crossings Atlanta 4500 4700 Cucumbers
2 1/2/2019 Mexico Crossings Atlanta 4500 4700 Eggplant
3 1/2/2019 Mexico Crossings Atlanta 4500 4700 Melons
4 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Apples
5 1/2/2019 Eastern North Carolina Baltimore 7000 8000 Pears
6 1/2/2019 Michigan Boston 3800 4000 Apples