Phyton: How to get the average of the n largest values for each column grouped by id
Question:
I’m trying to get the mean for each column while grouped by id. But I don’t get it to work as I want to.
The data:
ID Property3 Property2 Property3
1 10.2 ... ...
1 20.1
1 51.9
1 15.8
1 12.5
...
1203 104.4
1203 11.5
1203 19.4
1203 23.1
What I got so far:
I got those two tries. But they are both just for one column and I don’t know how to do it for more then just one.:
data.groupby('id')['property1'].apply(lambda grp: grp.nlargest(100).mean())
1 37.897989
2 33.059432
3 34.926530
4 33.036137
data.groupby('id').agg({'property1': {lambda grp: grp.nlargest(100).mean()}})
id property1 <lambda>
1 37.897989
2 33.059432
3 34.926530
4 33.036137
What I want:
Idealy, I would like to have a dataframe as follows:
ID Property3 Property2 Property3
1 37.8 5.6 2.3
2 33.0 1.5 10.4
3 34.9 91.5 10.3
4 33.0 10.3 14.3
So that each row contains the mean values for the 100 biggest values for EACH column grouped by id.
Answers:
Use GroupBy.agg
with omit columns for processing all columns in DataFrame without ID
:
df = data.groupby('ID').agg(lambda grp: grp.nlargest(100).mean()).reset_index()
print (df)
ID Property1 Property2 Property3
0 1.0 22.1 ... ...
1 1203.0 39.6 ... ...
Or specify columns after groupby
:
df = (data.groupby('ID')['Property1','Property2','Property3']
.agg(lambda grp: grp.nlargest(100).mean())
.reset_index())
Use GroupBy.agg
with as_index=False
so that id
is a column:
data.groupby('id',as_index=False).agg(lambda grp: grp.nlargest(100).mean())
df['average_max'] = df.groupby('ID').agg(lambda grp: grp.nlargest(10).mean())
I’m trying to get the mean for each column while grouped by id. But I don’t get it to work as I want to.
The data:
ID Property3 Property2 Property3
1 10.2 ... ...
1 20.1
1 51.9
1 15.8
1 12.5
...
1203 104.4
1203 11.5
1203 19.4
1203 23.1
What I got so far:
I got those two tries. But they are both just for one column and I don’t know how to do it for more then just one.:
data.groupby('id')['property1'].apply(lambda grp: grp.nlargest(100).mean())
1 37.897989
2 33.059432
3 34.926530
4 33.036137
data.groupby('id').agg({'property1': {lambda grp: grp.nlargest(100).mean()}})
id property1 <lambda>
1 37.897989
2 33.059432
3 34.926530
4 33.036137
What I want:
Idealy, I would like to have a dataframe as follows:
ID Property3 Property2 Property3
1 37.8 5.6 2.3
2 33.0 1.5 10.4
3 34.9 91.5 10.3
4 33.0 10.3 14.3
So that each row contains the mean values for the 100 biggest values for EACH column grouped by id.
Use GroupBy.agg
with omit columns for processing all columns in DataFrame without ID
:
df = data.groupby('ID').agg(lambda grp: grp.nlargest(100).mean()).reset_index()
print (df)
ID Property1 Property2 Property3
0 1.0 22.1 ... ...
1 1203.0 39.6 ... ...
Or specify columns after groupby
:
df = (data.groupby('ID')['Property1','Property2','Property3']
.agg(lambda grp: grp.nlargest(100).mean())
.reset_index())
Use GroupBy.agg
with as_index=False
so that id
is a column:
data.groupby('id',as_index=False).agg(lambda grp: grp.nlargest(100).mean())
df['average_max'] = df.groupby('ID').agg(lambda grp: grp.nlargest(10).mean())