Keep other columns when using sum() with groupby
Question:
I have a pandas dataframe below:
df
name value1 value2 otherstuff1 otherstuff2
0 Jack 1 1 1.19 2.39
1 Jack 1 2 1.19 2.39
2 Luke 0 1 1.08 1.08
3 Mark 0 1 3.45 3.45
4 Luke 1 0 1.08 1.08
Same name
will have the same value for otherstuff1
and otherstuff2
.
I’m trying to group by column name
and sum both columns value1
and value2
. (Not sum value1
with value2
!!! But sum them individually in each column.)
Expecting to get result below:
newdf
name value1 value2 otherstuff1 otherstuff2
0 Jack 2 3 1.19 2.39
1 Luke 1 1 1.08 1.08
2 Mark 0 1 3.45 3.45
I’ve tried
newdf = df.groupby(['name'], as_index=False).sum()
which groups by name
and sums up both value1
and value2
columns correctly, but ends up dropping columns otherstuff1
and otherstuff2
.
Answers:
Something like ?(Assuming you have same otherstuff1 and otherstuff2 under the same name )
df.groupby(['name','otherstuff1','otherstuff2'],as_index=False).sum()
Out[121]:
name otherstuff1 otherstuff2 value1 value2
0 Jack 1.19 2.39 2 3
1 Luke 1.08 1.08 1 1
2 Mark 3.45 3.45 0 1
You should specify what pandas must do with the other columns. In your case, I think you want to keep one row, regardless of its position within the group.
This could be done with agg
on a group. agg
accepts a parameter that specifies what operation should be performed for each column.
df.groupby(['name'], as_index=False).agg({'value1': 'sum', 'value2': 'sum', 'otherstuff1': 'first', 'otherstuff2': 'first'})
The key in the answer above is actually the as_index=False
, otherwise all the columns in the list get used in the index.
p_summ = p.groupby( attributes_list, as_index=False ).agg( {'AMT':sum })
These solutions are great, but when you have to many columns you do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)
I have a pandas dataframe below:
df
name value1 value2 otherstuff1 otherstuff2
0 Jack 1 1 1.19 2.39
1 Jack 1 2 1.19 2.39
2 Luke 0 1 1.08 1.08
3 Mark 0 1 3.45 3.45
4 Luke 1 0 1.08 1.08
Same name
will have the same value for otherstuff1
and otherstuff2
.
I’m trying to group by column name
and sum both columns value1
and value2
. (Not sum value1
with value2
!!! But sum them individually in each column.)
Expecting to get result below:
newdf
name value1 value2 otherstuff1 otherstuff2
0 Jack 2 3 1.19 2.39
1 Luke 1 1 1.08 1.08
2 Mark 0 1 3.45 3.45
I’ve tried
newdf = df.groupby(['name'], as_index=False).sum()
which groups by name
and sums up both value1
and value2
columns correctly, but ends up dropping columns otherstuff1
and otherstuff2
.
Something like ?(Assuming you have same otherstuff1 and otherstuff2 under the same name )
df.groupby(['name','otherstuff1','otherstuff2'],as_index=False).sum()
Out[121]:
name otherstuff1 otherstuff2 value1 value2
0 Jack 1.19 2.39 2 3
1 Luke 1.08 1.08 1 1
2 Mark 3.45 3.45 0 1
You should specify what pandas must do with the other columns. In your case, I think you want to keep one row, regardless of its position within the group.
This could be done with agg
on a group. agg
accepts a parameter that specifies what operation should be performed for each column.
df.groupby(['name'], as_index=False).agg({'value1': 'sum', 'value2': 'sum', 'otherstuff1': 'first', 'otherstuff2': 'first'})
The key in the answer above is actually the as_index=False
, otherwise all the columns in the list get used in the index.
p_summ = p.groupby( attributes_list, as_index=False ).agg( {'AMT':sum })
These solutions are great, but when you have to many columns you do not want to type all of the column names. So here is what I came up with:
column_map = {col: "first" for col in df.columns}
column_map["col_name1"] = "sum"
column_map["col_name2"] = lambda x: set(x) # it can also be a function or lambda
now you can simply do
df.groupby(["col_to_group"], as_index=False).aggreagate(column_map)