pandas dataframe groupby without losing the column which was grouped
Question:
On the last post i posted that i have a problem to remove/sum row based on certains conditions and someone helped to right a code like this.
Here is my code:
import pandas as pd
df=pd.DataFrame({
'cars':['Kia rio','Bmw','Mercedes','Ford','Kia','Mercedes Benz'],
'rent':[1,1,2,1,4,2],
'sale':[2,4,1,1,5,1],
'id':[2000,1000,3000,4000,2000,3000]
})
print(df)
df1 = df.drop_duplicates().groupby(['id'], sort=False, as_index=False).sum()
print(df1)
But when I run the groupby method it drop car columns. Can anyone help me on this?
I got this output:
id rent sale
0 2000 5 7
1 1000 1 4
2 3000 4 2
3 4000 1 1
Expected output:
cars rent sale id
Kia 5 7 2000
Bmw 1 4 1000
Mercedes 2 1 3000
Ford 1 1 4000
Answers:
You need aggregate columns for keep it in ouput (or pass to by
parameter like id
), so for avoid lost cars
column is used aggregate function last
for last values of cars
per groups, also is specified aggregation sum
for another 2 columns:
df1 = df.drop_duplicates().groupby('id',sort=False,as_index=False).agg(cars=('cars','last'),
rent=('rent', 'sum'),
sale=('sale', 'sum'))
print(df1)
id cars rent sale
0 2000 Kia 5 7
1 1000 Bmw 1 4
2 3000 Mercedes Benz 4 2
3 4000 Ford 1 1
If possible split
carnames by first space and aggregate by id
and cars
:
df['cars'] = df['cars'].str.split().str[0]
df1 = df.drop_duplicates().groupby(['id','cars'], sort=False, as_index=False).sum()
print(df1)
id cars rent sale
0 2000 Kia 5 7
1 1000 Bmw 1 4
2 3000 Mercedes 2 1
3 4000 Ford 1 1
I believe it is happening because cars
is non-numeric and this is skipped by default on DataFrame sum. Sum has an optional parameter numeric_onlybool
which says:
numeric_onlybool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
This question also has some good info about grouping the string parts:
Pandas groupby: How to get a union of strings
import pandas as pd
df = pd.DataFrame({
'cars':['Kia rio', 'Bmw', 'Mercedes', 'Ford', 'Kia', 'Mercedes Benz'],
'rent':[1, 1, 2, 1, 4, 2],
'sale':[2, 4, 1, 1, 5, 1],
'id': [2000, 1000, 3000, 4000, 2000, 3000]
})
print(df)
df1 = df.drop_duplicates().groupby(['id', 'cars'], sort=False, as_index=False).sum()
print(df1)
On the last post i posted that i have a problem to remove/sum row based on certains conditions and someone helped to right a code like this.
Here is my code:
import pandas as pd
df=pd.DataFrame({
'cars':['Kia rio','Bmw','Mercedes','Ford','Kia','Mercedes Benz'],
'rent':[1,1,2,1,4,2],
'sale':[2,4,1,1,5,1],
'id':[2000,1000,3000,4000,2000,3000]
})
print(df)
df1 = df.drop_duplicates().groupby(['id'], sort=False, as_index=False).sum()
print(df1)
But when I run the groupby method it drop car columns. Can anyone help me on this?
I got this output:
id rent sale
0 2000 5 7
1 1000 1 4
2 3000 4 2
3 4000 1 1
Expected output:
cars rent sale id
Kia 5 7 2000
Bmw 1 4 1000
Mercedes 2 1 3000
Ford 1 1 4000
You need aggregate columns for keep it in ouput (or pass to by
parameter like id
), so for avoid lost cars
column is used aggregate function last
for last values of cars
per groups, also is specified aggregation sum
for another 2 columns:
df1 = df.drop_duplicates().groupby('id',sort=False,as_index=False).agg(cars=('cars','last'),
rent=('rent', 'sum'),
sale=('sale', 'sum'))
print(df1)
id cars rent sale
0 2000 Kia 5 7
1 1000 Bmw 1 4
2 3000 Mercedes Benz 4 2
3 4000 Ford 1 1
If possible split
carnames by first space and aggregate by id
and cars
:
df['cars'] = df['cars'].str.split().str[0]
df1 = df.drop_duplicates().groupby(['id','cars'], sort=False, as_index=False).sum()
print(df1)
id cars rent sale
0 2000 Kia 5 7
1 1000 Bmw 1 4
2 3000 Mercedes 2 1
3 4000 Ford 1 1
I believe it is happening because cars
is non-numeric and this is skipped by default on DataFrame sum. Sum has an optional parameter numeric_onlybool
which says:
numeric_onlybool, default None
Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
Source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
This question also has some good info about grouping the string parts:
Pandas groupby: How to get a union of strings
import pandas as pd
df = pd.DataFrame({
'cars':['Kia rio', 'Bmw', 'Mercedes', 'Ford', 'Kia', 'Mercedes Benz'],
'rent':[1, 1, 2, 1, 4, 2],
'sale':[2, 4, 1, 1, 5, 1],
'id': [2000, 1000, 3000, 4000, 2000, 3000]
})
print(df)
df1 = df.drop_duplicates().groupby(['id', 'cars'], sort=False, as_index=False).sum()
print(df1)