Explode multiple columns in CSV with varying/unmatching element counts using Pandas

Question

I’m trying to use the explode function in pandas on 2 columns in a CSV that have varying element counts. I understand that one of the limitations of a multi-explode currently is that you can’t have nonmatching element counts in the target columns, so I’m wondering what you can do to get around this or if there’s something completely different besides explode?

Input:

Fruit	Color	Origin
Apple	Red, Green	USA; Canada
Plum	Purple	USA
Mango	Red, Yellow	Mexico; USA
Pepper	Red, Green	Mexico

Desired Output:

Fruit	Color	Origin
Apple	Red	USA
Apple	Green	Canada
Plum	Purple	USA
Mango	Red	Mexico
Mango	Yellow	USA
Pepper	Red	Mexico
Pepper	Green	Mexico

There is never more than 1 Origin value for rows with only 1 Color value.
Color values are always separated by ", " and Origin values are always separated by "; "

My code so far:

import pandas as pd
df = pd.read_csv('fruits.csv')
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
df = df.explode(['Color','Origin'])
df.to_csv('explode_fruit.csv', encoding='utf-8')

I get this error when running: "ValueError: columns must have matching element counts"

Asked By: KS1993

||

Source

Answer 1

The error is most likely due to the unequal number of values for colour and origin in the last row. As you have mentioned There is never more than 1 Origin value for rows with only 1 Color value. , you can try the following:

import pandas as pd
df = pd.DataFrame( {'Fruit':['Apple', 'Plum','Mango','Pepper'], 
                    'Color': ['Red, Green', 'Purple', 'Red, Yellow','Red, Green'], 
                    'Origin':['USA; Canada', 'USA', 'Mexico; USA', 'Mexico']
                })
df['Color'] = df['Color'].str.split(', ')
df['Origin'] = df['Origin'].str.split('; ')
# ensuring equal number of color and origin in each cell
df['Origin'] =df.apply(lambda x: x['Origin']* len(x['Color']) if len(x['Color'])>len(x['Origin']) else x['Origin'], axis=1)
df = df.explode(['Color','Origin']).reset_index(drop=True)

Answered By: LazyClown

Explode multiple columns in CSV with varying/unmatching element counts using Pandas

Question:

Answers: