pandas how to count column boolean value that based on group
Question:
I have a Dataframe df ,you can have it by running the following code:
import pandas as pd
from io import StringIO
df = """
month status_review supply_review
2023-01-01 False False
2023-01-01 True True
2022-12-01 False True
2022-12-01 True True
2022-12-01 False False
"""
df= pd.read_csv(StringIO(df.strip()), sep='ss+', engine='python')
How can I count how many status_reviews and supply_review are True in each month?
The output should look like the following:
month # of true status_review # of true supply_review
2023-01-01 1 1
2022-12-01 1 2
Answers:
You can use .groupby()
and .sum()
, which is more concise than using .agg()
:
df.groupby("month").sum()
This outputs:
status_review supply_review
month
2022-12-01 1 2
2023-01-01 1 1
Assuming you have one date per month, like in this case, you can simply use groupby
and sum all columns.
import pandas as pd
from io import StringIO
# setup sample data
df = """
month status_review supply_review
2023-01-01 False False
2023-01-01 True True
2022-12-01 False True
2022-12-01 True True
2022-12-01 False False
"""
df= pd.read_csv(StringIO(df.strip()), sep='ss+', engine='python')
# sum all rows by month column
df.groupby('month').agg('sum')
If you have multiple dates per month, then you can convert the column to datetime
type and then sum by month as follows:
df['month'] = pd.to_datetime(df['month'])
df.groupby([pd.Grouper(key='month', freq='M')]).agg('sum')
use groupby and value_counts() with a reset_index to create a dataframe column for the counts. find the p-value by using the wilcoxon
data = """
month status_review supply_review
2023-01-01 False False
2023-01-01 True True
2022-12-01 False True
2022-12-01 True True
2022-12-01 False False
"""
df= pd.read_csv(StringIO(data.strip()), sep='ss+', engine='python')
print(df)
results_status=df.groupby('month')['status_review'].value_counts().reset_index(name='n').sort_values('month')
results_supply=df.groupby('month')['supply_review'].value_counts().reset_index(name='n').sort_values('month')
results=pd.concat([results_status,results_supply],axis=1)
print(results)
output
month status_review n month supply_review n
0 2022-12-01 False 2 2022-12-01 True 2
1 2022-12-01 True 1 2022-12-01 False 1
2 2023-01-01 False 1 2023-01-01 False 1
3 2023-01-01 True 1 2023-01-01 True 1
I have a Dataframe df ,you can have it by running the following code:
import pandas as pd
from io import StringIO
df = """
month status_review supply_review
2023-01-01 False False
2023-01-01 True True
2022-12-01 False True
2022-12-01 True True
2022-12-01 False False
"""
df= pd.read_csv(StringIO(df.strip()), sep='ss+', engine='python')
How can I count how many status_reviews and supply_review are True in each month?
The output should look like the following:
month # of true status_review # of true supply_review
2023-01-01 1 1
2022-12-01 1 2
You can use .groupby()
and .sum()
, which is more concise than using .agg()
:
df.groupby("month").sum()
This outputs:
status_review supply_review
month
2022-12-01 1 2
2023-01-01 1 1
Assuming you have one date per month, like in this case, you can simply use groupby
and sum all columns.
import pandas as pd
from io import StringIO
# setup sample data
df = """
month status_review supply_review
2023-01-01 False False
2023-01-01 True True
2022-12-01 False True
2022-12-01 True True
2022-12-01 False False
"""
df= pd.read_csv(StringIO(df.strip()), sep='ss+', engine='python')
# sum all rows by month column
df.groupby('month').agg('sum')
If you have multiple dates per month, then you can convert the column to datetime
type and then sum by month as follows:
df['month'] = pd.to_datetime(df['month'])
df.groupby([pd.Grouper(key='month', freq='M')]).agg('sum')
use groupby and value_counts() with a reset_index to create a dataframe column for the counts. find the p-value by using the wilcoxon
data = """
month status_review supply_review
2023-01-01 False False
2023-01-01 True True
2022-12-01 False True
2022-12-01 True True
2022-12-01 False False
"""
df= pd.read_csv(StringIO(data.strip()), sep='ss+', engine='python')
print(df)
results_status=df.groupby('month')['status_review'].value_counts().reset_index(name='n').sort_values('month')
results_supply=df.groupby('month')['supply_review'].value_counts().reset_index(name='n').sort_values('month')
results=pd.concat([results_status,results_supply],axis=1)
print(results)
output
month status_review n month supply_review n
0 2022-12-01 False 2 2022-12-01 True 2
1 2022-12-01 True 1 2022-12-01 False 1
2 2023-01-01 False 1 2023-01-01 False 1
3 2023-01-01 True 1 2023-01-01 True 1