use groupby() and for loop to count column values with conditions

Question

The logic of what I am trying to do I think is best explained with code:

import pandas as pd
import numpy as np
from datetime import timedelta

random.seed(365)

#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
    {"start_date":start_date,
    "end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")

I first create a pd.Series with the 1st day of every month in the entire history of the data:

dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time

What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)

I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:

month_start	count
2015-01-01	5
2015-02-01	10
2015-03-01	35

The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null – this occurs for every value in the series

Here is the logic of what I am trying to do:

df.groupby(by = dates)[["start_date", "end_date"]].apply(
    lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)

Asked By: JoMcGee

||

Source

Answer 1

Is this what you want:

df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))

Answered By: Galo do Leste

Answer 2

IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:

(df['end_date'].isna()
 .groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
 .sum()
 .cumsum()
 )

Output:

start_date
2015-02-01      0
2015-03-01      0
2015-04-01      0
2015-05-01      0
2015-06-01      0
             ... 
2022-06-01    122
2022-07-01    127
2022-08-01    133
2022-09-01    138
2022-10-01    140
Name: end_date, Length: 93, dtype: int64

Answered By: mozway

use groupby() and for loop to count column values with conditions

Question:

Answers: