Select top n groups in pandas dataframe
Question:
I have the following dataframe:
Country Crop Harvest Year Area (ha)
Afghanistan Maize 2019 94910
Afghanistan Maize 2020 140498
Afghanistan Maize 2021 92144
Afghanistan Winter Wheat 2019 2334000
Afghanistan Winter Wheat 2020 2668000
Afghanistan Winter Wheat 2021 1833357
Argentina Maize 2019 7232761
Argentina Maize 2020 7730506
Argentina Maize 2021 8146596
Argentina Winter Wheat 2019 6050953
Argentina Winter Wheat 2020 6729838
Argentina Winter Wheat 2021 6394102
China Maize 2019 41309740
China Maize 2020 41292000
China Maize 2021 43355859
China Winter Wheat 2019 23732560
China Winter Wheat 2020 23383000
China Winter Wheat 2021 23571400
Ethiopia Maize 2019 2274306
Ethiopia Maize 2020 2363507
Ethiopia Maize 2021 2530000
Ethiopia Winter Wheat 2019 1789372
Ethiopia Winter Wheat 2020 1829051
Ethiopia Winter Wheat 2021 1950000
France Maize 2019 1506100
France Maize 2020 1691130
France Maize 2021 1549520
France Winter Wheat 2019 5244250
France Winter Wheat 2020 4512420
France Winter Wheat 2021 5276730
India Maize 2019 9027130
India Maize 2020 9569060
India Maize 2021 9860000
India Winter Wheat 2019 29318780
India Winter Wheat 2020 31357020
India Winter Wheat 2021 31610000
Namibia Maize 2019 21123
Namibia Maize 2020 35000
Namibia Maize 2021 46070
Namibia Winter Wheat 2019 1079
Namibia Winter Wheat 2020 2000
Namibia Winter Wheat 2021 3026
I want to select the top 2 countries by the average value of Area (ha)
column across the `Harvest Year’s. I tried this but it does not work:
df = df.groupby("Crop", dropna=False).apply( lambda x: x.nlargest(2, "Area (ha)") )
Output should be, here China and india are the countries with the largest average Area (ha)
for both maize and Winter Wheat, but in the full datasets different countries would have largest values for different crops:
Country Crop Harvest Year Area (ha)
China Maize 2019 41309740
China Maize 2020 41292000
China Maize 2021 43355859
China Winter Wheat 2019 23732560
China Winter Wheat 2020 23383000
China Winter Wheat 2021 23571400
India Maize 2019 9027130
India Maize 2020 9569060
India Maize 2021 9860000
India Winter Wheat 2019 29318780
India Winter Wheat 2020 31357020
India Winter Wheat 2021 31610000
Answers:
You mixed up Crop
with Harvest Year
in your grouping. First, limit your data to all items of a specific year, then keep the n-largest:
import pandas as pd
df = pd.DataFrame([
["Califlower", 2019, 1],
["Sunflower", 2020, 2],
["Maize", 2020, 3],
["Carrot", 2020, 4],
["Beet", 2019, 5],
], columns=["Country", "Harvest Year", "Area"])
df.groupby("Harvest Year").apply( lambda x: x.nlargest(2, "Area") )
result is as expected
IIUC, you can do double .groupby
:
x = (
df.groupby("Crop")
.apply(lambda x: x.groupby("Country")["Area (ha)"].mean())
.stack()
.groupby(level=0, group_keys=False)
.nlargest(2)
)
print(x)
Prints top 2 Crop/Countries by average area:
Crop Country
Maize China 4.198587e+07
India 9.485397e+06
Winter Wheat India 3.076193e+07
China 2.356232e+07
dtype: float64
Then you can use this index to filter the original dataframe:
out = df.set_index(["Crop", "Country"]).loc[x.index].reset_index()
print(out)
Prints:
Crop Country Harvest Year Area (ha)
0 Maize China 2019 41309740
1 Maize China 2020 41292000
2 Maize China 2021 43355859
3 Maize India 2019 9027130
4 Maize India 2020 9569060
5 Maize India 2021 9860000
6 Winter Wheat India 2019 29318780
7 Winter Wheat India 2020 31357020
8 Winter Wheat India 2021 31610000
9 Winter Wheat China 2019 23732560
10 Winter Wheat China 2020 23383000
11 Winter Wheat China 2021 23571400
I have the following dataframe:
Country Crop Harvest Year Area (ha)
Afghanistan Maize 2019 94910
Afghanistan Maize 2020 140498
Afghanistan Maize 2021 92144
Afghanistan Winter Wheat 2019 2334000
Afghanistan Winter Wheat 2020 2668000
Afghanistan Winter Wheat 2021 1833357
Argentina Maize 2019 7232761
Argentina Maize 2020 7730506
Argentina Maize 2021 8146596
Argentina Winter Wheat 2019 6050953
Argentina Winter Wheat 2020 6729838
Argentina Winter Wheat 2021 6394102
China Maize 2019 41309740
China Maize 2020 41292000
China Maize 2021 43355859
China Winter Wheat 2019 23732560
China Winter Wheat 2020 23383000
China Winter Wheat 2021 23571400
Ethiopia Maize 2019 2274306
Ethiopia Maize 2020 2363507
Ethiopia Maize 2021 2530000
Ethiopia Winter Wheat 2019 1789372
Ethiopia Winter Wheat 2020 1829051
Ethiopia Winter Wheat 2021 1950000
France Maize 2019 1506100
France Maize 2020 1691130
France Maize 2021 1549520
France Winter Wheat 2019 5244250
France Winter Wheat 2020 4512420
France Winter Wheat 2021 5276730
India Maize 2019 9027130
India Maize 2020 9569060
India Maize 2021 9860000
India Winter Wheat 2019 29318780
India Winter Wheat 2020 31357020
India Winter Wheat 2021 31610000
Namibia Maize 2019 21123
Namibia Maize 2020 35000
Namibia Maize 2021 46070
Namibia Winter Wheat 2019 1079
Namibia Winter Wheat 2020 2000
Namibia Winter Wheat 2021 3026
I want to select the top 2 countries by the average value of Area (ha)
column across the `Harvest Year’s. I tried this but it does not work:
df = df.groupby("Crop", dropna=False).apply( lambda x: x.nlargest(2, "Area (ha)") )
Output should be, here China and india are the countries with the largest average Area (ha)
for both maize and Winter Wheat, but in the full datasets different countries would have largest values for different crops:
Country Crop Harvest Year Area (ha)
China Maize 2019 41309740
China Maize 2020 41292000
China Maize 2021 43355859
China Winter Wheat 2019 23732560
China Winter Wheat 2020 23383000
China Winter Wheat 2021 23571400
India Maize 2019 9027130
India Maize 2020 9569060
India Maize 2021 9860000
India Winter Wheat 2019 29318780
India Winter Wheat 2020 31357020
India Winter Wheat 2021 31610000
You mixed up Crop
with Harvest Year
in your grouping. First, limit your data to all items of a specific year, then keep the n-largest:
import pandas as pd
df = pd.DataFrame([
["Califlower", 2019, 1],
["Sunflower", 2020, 2],
["Maize", 2020, 3],
["Carrot", 2020, 4],
["Beet", 2019, 5],
], columns=["Country", "Harvest Year", "Area"])
df.groupby("Harvest Year").apply( lambda x: x.nlargest(2, "Area") )
result is as expected
IIUC, you can do double .groupby
:
x = (
df.groupby("Crop")
.apply(lambda x: x.groupby("Country")["Area (ha)"].mean())
.stack()
.groupby(level=0, group_keys=False)
.nlargest(2)
)
print(x)
Prints top 2 Crop/Countries by average area:
Crop Country
Maize China 4.198587e+07
India 9.485397e+06
Winter Wheat India 3.076193e+07
China 2.356232e+07
dtype: float64
Then you can use this index to filter the original dataframe:
out = df.set_index(["Crop", "Country"]).loc[x.index].reset_index()
print(out)
Prints:
Crop Country Harvest Year Area (ha)
0 Maize China 2019 41309740
1 Maize China 2020 41292000
2 Maize China 2021 43355859
3 Maize India 2019 9027130
4 Maize India 2020 9569060
5 Maize India 2021 9860000
6 Winter Wheat India 2019 29318780
7 Winter Wheat India 2020 31357020
8 Winter Wheat India 2021 31610000
9 Winter Wheat China 2019 23732560
10 Winter Wheat China 2020 23383000
11 Winter Wheat China 2021 23571400