pandas how to count boolean column value and the distinct count of other columns at the same time

Question:

I have a Dataframe df, you can have it by running the following code:

import pandas as pd
from io import StringIO
    
df = """  
    month       status_review             supply_review  case_id
2023-01-01           False                    False      12             
2023-01-01           True                     True       33           
2022-12-01           False                    True       45           
2022-12-01           True                     True       45         
2022-12-01           False                    False      44
    """
df= pd.read_csv(StringIO(df.strip()), sep='ss+', engine='python')

How can I count how many status_reviews and supply_review are True in each month and also the number of case in each month?

The output should looks like the following:

    month       # of true status_review      # of true supply_review  # of case
2023-01-01           1                         1                        2
2022-12-01           1                         2                        2

I have tried both:

df.groupby("month").sum()
df.groupby('month').agg('sum')

But the output is:

           status_review    supply_review   case_id
month           
2022-12-01             1    2               134
2023-01-01             1    1               45

The case_id is not what I want. I want the distinct count of case_id. How can I achieve the desired output?

Asked By: William

||

Answers:

You can use .groupby() and .agg():

df.groupby("month").agg({
    "status_review": "sum",
    "supply_review": "sum",
    "case_id": pd.Series.nunique
})

This outputs:

            status_review  supply_review  case_id
month
2022-12-01              1              2        2
2023-01-01              1              1        2
Answered By: BrokenBenchmark

You can use pd.NamedAgg to rename your columns like this:

df.groupby('month', as_index=False).agg(num_status_review=('status_review', 'sum'),
                                        num_supply_review=('supply_review', 'sum'),
                                        num_case_id=('case_id', 'nunique'))

Output:

        month  num_status_review  num_supply_review  num_case_id
0  2022-12-01                  1                  2            2
1  2023-01-01                  1                  1            2
Answered By: Scott Boston