Groupby column with multiple values

Question:

I have a dataframe that looks like this one (one column has multiple values, the other are just numbers with decimals):

food number
apple,tomato,melon 897.0
apple,meat,banana 984.9
banana,tomato 340.8

I want to get the average number of every food. In the example that’ll be:

  • apple = (897.0 + 984.9)/2 = 940.95
  • banana = (984.9+340.8)/2 = 662.85

And so on to the point of ending up with a new dataframe with just the foods and the average number.

food average
apple 915.95
banana 662.85

I tried my luck with groupby, but the result is all messed up:

#reshape data
df = pd.DataFrame({
    'food' : list(chain.from_iterable(df.food.tolist())), 
    'number' : df.number.repeat(df.food.str.len())
})
# groupby
df.groupby('food').number.apply(lambda x: x.unique().tolist())

I must say that the original dataframe has over 100k rows. Thanks.

Asked By: Andrew

||

Answers:

Use DataFrame.explode(<column-name>) to expand the individual items in the lists into separate cells. They keep the original index, so the corresponding number gets filled in. From there, it’s an easy group by, followed by a simple mean.

import pandas as pd

df = pd.DataFrame({'food': [['apple', 'tomato', 'melon'], 
                            ['apple','meat', 'banana'],
                            ['banana', 'tomato']], 
                   'number': [897, 984.9, 340.8]})

df.explode('food').groupby('food').mean()

results in

        number
food          
apple   940.95
banana  662.85
meat    984.90
melon   897.00
tomato  618.90
Answered By: 9769953

First you will have to convert the string column to a list in each cell. I’ve also included the ability to remove white spaces if any. I modify from the df created by @9769953

import pandas as pd
df = pd.DataFrame({'food': ["apple,tomato, melon", 
                            "apple,meat,banana,melon",
                            "banana, tomato, melon"], 
                   'number': [897, 984.9, 340.8]})

df['food'] = df['food'].str.split(',').apply(lambda x: [e.strip() for e in x]).tolist()
df.explode('food').groupby('food').agg('mean')

Output

enter image description here

If you would like more aggregations, you could use

df.explode('food').groupby('food').agg(['min', 'mean', 'max'])

enter image description here

Answered By: perpetualstudent