Explode multiple columns in CSV with varying/unmatching element counts using Pandas
Question:
I’m trying to use the explode function in pandas on 2 columns in a CSV that have varying element counts. I understand that one of the limitations of a multi-explode currently is that you can’t have nonmatching element counts in the target columns, so I’m wondering what you can do to get around this or if there’s something completely different besides explode?
Input:
Fruit
Color
Origin
Apple
Red, Green
USA; Canada
Plum
Purple
USA
Mango
Red, Yellow
Mexico; USA
Pepper
Red, Green
Mexico
Desired Output:
Fruit
Color
Origin
Apple
Red
USA
Apple
Green
Canada
Plum
Purple
USA
Mango
Red
Mexico
Mango
Yellow
USA
Pepper
Red
Mexico
Pepper
Green
Mexico
There is never more than 1 Origin value for rows with only 1 Color value.
Color values are always separated by ", " and Origin values are always separated by "; "
My code so far:
import pandas as pd
df = pd.read_csv('fruits.csv')
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
df = df.explode(['Color','Origin'])
df.to_csv('explode_fruit.csv', encoding='utf-8')
I get this error when running: "ValueError: columns must have matching element counts"
Answers:
The error is most likely due to the unequal number of values for colour and origin in the last row. As you have mentioned There is never more than 1 Origin value for rows with only 1 Color value. , you can try the following:
import pandas as pd
df = pd.DataFrame( {'Fruit':['Apple', 'Plum','Mango','Pepper'],
'Color': ['Red, Green', 'Purple', 'Red, Yellow','Red, Green'],
'Origin':['USA; Canada', 'USA', 'Mexico; USA', 'Mexico']
})
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
# ensuring equal number of color and origin in each cell
df['Origin'] =df.apply(lambda x: x['Origin']* len(x['Color']) if len(x['Color'])>len(x['Origin']) else x['Origin'], axis=1)
df = df.explode(['Color','Origin']).reset_index(drop=True)
I’m trying to use the explode function in pandas on 2 columns in a CSV that have varying element counts. I understand that one of the limitations of a multi-explode currently is that you can’t have nonmatching element counts in the target columns, so I’m wondering what you can do to get around this or if there’s something completely different besides explode?
Input:
Fruit | Color | Origin |
---|---|---|
Apple | Red, Green | USA; Canada |
Plum | Purple | USA |
Mango | Red, Yellow | Mexico; USA |
Pepper | Red, Green | Mexico |
Desired Output:
Fruit | Color | Origin |
---|---|---|
Apple | Red | USA |
Apple | Green | Canada |
Plum | Purple | USA |
Mango | Red | Mexico |
Mango | Yellow | USA |
Pepper | Red | Mexico |
Pepper | Green | Mexico |
There is never more than 1 Origin value for rows with only 1 Color value.
Color values are always separated by ", " and Origin values are always separated by "; "
My code so far:
import pandas as pd
df = pd.read_csv('fruits.csv')
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
df = df.explode(['Color','Origin'])
df.to_csv('explode_fruit.csv', encoding='utf-8')
I get this error when running: "ValueError: columns must have matching element counts"
The error is most likely due to the unequal number of values for colour and origin in the last row. As you have mentioned There is never more than 1 Origin value for rows with only 1 Color value. , you can try the following:
import pandas as pd
df = pd.DataFrame( {'Fruit':['Apple', 'Plum','Mango','Pepper'],
'Color': ['Red, Green', 'Purple', 'Red, Yellow','Red, Green'],
'Origin':['USA; Canada', 'USA', 'Mexico; USA', 'Mexico']
})
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
# ensuring equal number of color and origin in each cell
df['Origin'] =df.apply(lambda x: x['Origin']* len(x['Color']) if len(x['Color'])>len(x['Origin']) else x['Origin'], axis=1)
df = df.explode(['Color','Origin']).reset_index(drop=True)