How to use date_range with day frequency?

Question:

I tried to use date_range with a day frequency on this dataframe:

df = pd.DataFrame({'Start':['2022-06-07 06:24:48','2022-06-07 14:37:16','2022-06-07 08:00:59'],
                   'End':['2022-06-07 14:07:00','2022-06-08 02:51:21','2022-06-09 13:18:34'],
                   'Process':['PROD','VORG','STO'], 
                   'Duration_Min':[462.20,734.08,3197.58]})

df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
df['difference']=df['End'].dt.date-df['Start'].dt.date

def find_interval(sr):
    dti = pd.date_range(sr['Start'], sr['End'], freq='1D').normalize() + pd.Timedelta(days=1)
    return list(zip([sr['Start']] + dti.tolist(), dti.tolist() + [sr['End']]))

df1 = df.apply(find_interval, axis=1).explode().apply(pd.Series)
df1 = df.drop(columns=['Start', 'End']).join(df1).rename(columns={0: 'Start', 1: 'End'})
df1['Duration_Min']=(df1["End"]-df1["Start"]).dt.total_seconds().div(60)

What I get is:

Process  Duration_Min difference               Start                 End
0    PROD   1055.200000     0 days 2022-06-07 06:24:48 2022-06-08 00:00:00
0    PROD   -593.000000     0 days 2022-06-08 00:00:00 2022-06-07 14:07:00
1    VORG    562.733333     1 days 2022-06-07 14:37:16 2022-06-08 00:00:00
1    VORG    171.350000     1 days 2022-06-08 00:00:00 2022-06-08 02:51:21
2     STO    959.016667     2 days 2022-06-07 08:00:59 2022-06-08 00:00:00
2     STO   1440.000000     2 days 2022-06-08 00:00:00 2022-06-09 00:00:00
2     STO   1440.000000     2 days 2022-06-09 00:00:00 2022-06-10 00:00:00
2     STO   -641.433333     2 days 2022-06-10 00:00:00 2022-06-09 13:18:34

I would like to cut the events so that new timestamps with new intervals are created when the day changes between Start and End. If the difference between the dates is 0 days I don’t need to create new timestamps and with the Timedelta(days=1) the End timestamp is mismatched. The column Days should be corosponding with weekday()

What I want:

Start                 End                   Process  Duration_Min  Days
0 2022-06-07 06:24:48 2022-06-07 14:07:00    PROD    462.200000     1
1 2022-06-07 14:37:16 2022-06-07 23:59:59    VORG    562.716667     1
2 2022-06-08 00:00:00 2022-06-08 02:51:21    VORG    171.350000     2
3 2022-06-07 08:00:59 2022-06-07 23:59:59     STO    959.000000     1
4 2022-06-08 00:00:00 2022-06-08 23:59:59     STO   1439.983333     2
5 2022-06-09 00:00:00 2022-06-09 13:18:34     STO    798.566667     3

How could I achieve this?

Asked By: phappy

||

Answers:

Alright so the first thing you want to do is have a date column:

df["date"]=df[["Start","End"]].min(axis=1).dt.date

Once you have that, you will now want to groupby according to your relevant columns

    df = df.groupby(["date",
"Process"]).agg({"Start":"min","End":"min","Duration_Min":"sum", "Days":"any"}).reset_index()

And you should end up with the relevant dataframe

Answered By: xzelda

You could try:

def find_interval(row):
    start, end = row.at["Start"], row.at["End"]
    days = pd.date_range(start, end, freq="D", normalize=True).to_list()
    if len(days) == 1 or days[-1] != end:
        days.append(end)
    days[0] = start
    return list(zip(days, days[1:]))

result = (
    df
    .assign(Days=df.apply(find_interval, axis=1))
    .explode("Days")
    .assign(
        Start=lambda df: df["Days"].str[0],
        End=lambda df: df["Days"].str[1],
        Duration_Min=lambda df:
            (df["End"] - df["Start"]).dt.total_seconds().div(60),
        Days=lambda df: df.groupby("Process").transform("cumcount") + 1
    )
)

Result for your df:

                 Start                  End Process  Duration_Min  Days
0  2022-06-07 06:24:48  2022-06-07 14:07:00    PROD    462.200000     1
1  2022-06-07 14:37:16  2022-06-08 00:00:00    VORG    562.733333     1
1  2022-06-08 00:00:00  2022-06-08 02:51:21    VORG    171.350000     2
2  2022-06-07 08:00:59  2022-06-08 00:00:00     STO    959.016667     1
2  2022-06-08 00:00:00  2022-06-09 00:00:00     STO   1440.000000     2
2  2022-06-09 00:00:00  2022-06-09 13:18:34     STO    798.566667     3

If, as indicated in the comments, the substantial part of df doesn’t need the day-separation, then the following might be better:

m = df["End"].dt.date != df["Start"].dt.date
result = (
    df[m]
    .assign(Days=df.apply(find_interval, axis=1))
    ... <see above> ...
)
result = pd.concat([df[~m].assign(Days=1), result]).sort_index()

The .sort_index()-part is to make sure that the Process-order is the same as in df. Remove it, if that’s not important.

Answered By: Timus
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.