pandas dataframe groupby without losing the column which was grouped

Question:

On the last post i posted that i have a problem to remove/sum row based on certains conditions and someone helped to right a code like this.

Here is my code:

import pandas as pd
df=pd.DataFrame({
                'cars':['Kia rio','Bmw','Mercedes','Ford','Kia','Mercedes Benz'],
                'rent':[1,1,2,1,4,2],
                'sale':[2,4,1,1,5,1],
                'id':[2000,1000,3000,4000,2000,3000]
                })
print(df)
df1 = df.drop_duplicates().groupby(['id'], sort=False, as_index=False).sum()
print(df1)

But when I run the groupby method it drop car columns. Can anyone help me on this?

I got this output:

     id  rent  sale
0  2000     5     7
1  1000     1     4
2  3000     4     2
3  4000     1     1

Expected output:

    cars      rent  sale  id
    Kia       5      7    2000
    Bmw       1      4    1000
    Mercedes  2      1    3000
    Ford      1      1    4000
Asked By: user16253345

||

Answers:

You need aggregate columns for keep it in ouput (or pass to by parameter like id), so for avoid lost cars column is used aggregate function last for last values of cars per groups, also is specified aggregation sum for another 2 columns:

df1 = df.drop_duplicates().groupby('id',sort=False,as_index=False).agg(cars=('cars','last'),
                                                                       rent=('rent', 'sum'),
                                                                       sale=('sale', 'sum'))
print(df1)
     id           cars  rent  sale
0  2000            Kia     5     7
1  1000            Bmw     1     4
2  3000  Mercedes Benz     4     2
3  4000           Ford     1     1

If possible split carnames by first space and aggregate by id and cars:

df['cars'] = df['cars'].str.split().str[0]
df1 = df.drop_duplicates().groupby(['id','cars'], sort=False, as_index=False).sum()
print(df1)
     id      cars  rent  sale
0  2000       Kia     5     7
1  1000       Bmw     1     4
2  3000  Mercedes     2     1
3  4000      Ford     1     1
Answered By: jezrael

I believe it is happening because cars is non-numeric and this is skipped by default on DataFrame sum. Sum has an optional parameter numeric_onlybool which says:

numeric_onlybool, default None

    Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.

Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html

This question also has some good info about grouping the string parts:
Pandas groupby: How to get a union of strings

Answered By: Stephan Pieterse
import pandas as pd
df = pd.DataFrame({
                'cars':['Kia rio', 'Bmw', 'Mercedes', 'Ford', 'Kia', 'Mercedes Benz'],
                'rent':[1, 1, 2, 1, 4, 2],
                'sale':[2, 4, 1, 1, 5, 1],
                'id': [2000, 1000, 3000, 4000, 2000, 3000]
                })
print(df)
df1 = df.drop_duplicates().groupby(['id', 'cars'], sort=False, as_index=False).sum()
print(df1)


Answered By: tobias ansaldi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.