How can I find the year has the highest value for each subset of a large dataframe using Pandas?

Question

I have a dataframe with 10 years of humidity and temperature data for each of 100 dataloggers. Here is the head:

The loggers are labeled as L1 through L100.

My end goal is to have a dataframe with three columns: year, the count of dataloggers where that year had the highest average humidity, and the count of dataloggers where that year had the highest average temperature.

I thought that this code would work:

import pandas as pd

df_years = pd.DataFrame({'year': list(range(1985, 1996))})

traits = ['avg_humidity', 'avg_temp']

for trait in traits:
    df_logger = df.groupby('logger')['year', trait].max().reset_index()
    df_input = df_logger['year'].value_counts().reset_index()
    df_input.columns = ['year', f'count_{trait}']
    df_years = df_years.merge(df_input, how='left', on='year').fillna(0)

But this output is leaving me with the same values for both traits, making me think I’ve done things wrong. Even when I just look at one trait:

df_logger = df.groupby('logger')['year','avg_humidity'].max().reset_index()
df_input = df_logger['year'].value_counts().reset_index()
df_input.columns = ['year', 'count']
df_merged = df_years.merge(df_input, how='left', on='year').fillna(0)

The data seem incorrect. I think my whole process is wrong here. Any help would be massively appreciated. Thanks so much in advance.

Asked By: Trev

||

Source

Answer 1

I believe your mistake was in that line:

df_logger = df.groupby('logger')['year', trait].max().reset_index()

as if I understood correctly what you are trying to do, you want here to get for each logger the year for which the trait is at its maximum value. In other words, you want to get the index of the maximum value of the trait and grab the year corresponding to that index. What you did gives you instead the max value of both columns.

I believe this will do what you want:

import pandas as pd

df_years = pd.DataFrame({'year': list(range(1985, 1996))})

traits = ['avg_humidity', 'avg_temp']

for trait in traits:
    df_logger = df.loc[df.groupby('logger')[trait].idxmax(), 'year'].reset_index()
    df_input = df_logger['year'].value_counts().reset_index()
    df_input.columns = ['year', f'count_{trait}']
    df_years = df_years.merge(df_input, how='left', on='year').fillna(0)

Answered By: R_D

How can I find the year has the highest value for each subset of a large dataframe using Pandas?

Question:

Answers: