Using later of two dates and then subtracting another date within groupby

Question:

I have a dataframe that looks something like this:

df = pd.DataFrame([1,'A','X','1/3/22 12:00:00AM','1/1/22 12:00:00 AM','1/2/22 12:00:00 AM'],
[1,'A','X','1/4/22 12:00:00AM','1/3/22 12:00:00 AM','1/3/22 12:00:00 AM'],
[1,'A','Y','1/3/22 12:00:00AM','1/2/22 12:00:00 AM','1/1/22 12:00:00 AM'],
[1,'B','X','1/3/22 12:00:00AM','1/2/22 12:00:00 AM','1/3/22 12:00:00 AM'],
[2,'A','X','1/5/22 12:00:00AM','1/3/22 12:00:00 AM','1/4/22 12:00:00 AM'],
[2,'A','X','1/6/22 12:00:00AM','1/4/22 12:00:00 AM','1/5/22 12:00:00 AM']],
columns = ['ID','Category','Site','Task Completed','Access Completed', 'Upload Completed'])
ID Category Site Task Completed Access Completed Upload Completed
1 A X 1/3/22 12:00:00AM 1/1/22 12:00:00 AM 1/2/22 12:00:00 AM
1 A X 1/4/22 12:00:00AM 1/3/22 12:00:00 AM 1/3/22 12:00:00 AM
1 A Y 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/1/22 12:00:00 AM
1 B X 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/3/22 12:00:00 AM
2 A X 1/5/22 12:00:00AM 1/3/22 12:00:00 AM 1/4/22 12:00:00 AM
2 A X 1/6/22 12:00:00AM 1/4/22 12:00:00 AM 1/5/22 12:00:00 AM

What I want to find is the time difference (in hours) between the later of max access completed date or min upload completed date and the first Task Completed date for every ID/Category/Site combination within the dataset.

In plain text, the logic is (the later date of (max access completed date vs min upload completed date)) – (min/first task completed date).

Based on this dataset, my intended result would look like:

ID Category Site Time Difference First Task Completed Max Access Date Min Upload Date
1 A X 0 1/3/22 12:00:00AM 1/3/22 12:00:00 AM 1/2/22 12:00:00 AM
1 A Y 24 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/1/22 12:00:00 AM
1 B X 0 1/3/22 12:00:00AM 1/2/22 12:00:00 AM 1/3/22 12:00:00 AM
2 A X 24 1/5/22 12:00:00AM 1/4/22 12:00:00 AM 1/4/22 12:00:00 AM

I know how to get the max access date minus the first task completed date:

out = (df
   .groupby(['ID', 'Category', 'Site'], as_index=False)
   .agg({'Task Completed': 'first', 'Access Completed': 'max'})
   .assign(**{'Time Difference': lambda d: d['Task Completed']
              .sub(d['Access Completed'])
              .dt.total_seconds().floordiv(3600)})
)

Which has output:

   ID Category Site      Task Completed Access Completed  Time Difference
0   1        A    X 2022-01-03 00:00:00       2022-01-02             24.0
1   1        A    Y 2022-01-01 01:00:00       2022-01-01              1.0
2   1        B    X 2022-01-01 01:00:00       2022-01-01              1.0
3   2        A    X 2022-01-03 00:00:00       2022-01-02             24.0

But I don’t know how to incorporate min upload completed date.

Asked By: cowboycoder

||

Answers:

Try:

df["Task Completed"] = pd.to_datetime(
    df["Task Completed"], format="%m/%d/%y %H:%M:%S%p"
)
df["Access Completed"] = pd.to_datetime(
    df["Access Completed"], format="%m/%d/%y %H:%M:%S %p"
)
df["Upload Completed"] = pd.to_datetime(
    df["Upload Completed"], format="%m/%d/%y %H:%M:%S %p"
)

out = df.groupby(["ID", "Category", "Site"], as_index=False).agg(
    {
        "Task Completed": "first",
        "Access Completed": "max",
        "Upload Completed": "min",
    }
)

out["Time Difference"] = np.where(
    (out["Access Completed"] - out["Upload Completed"]) > pd.Timedelta(0),
    (out["Access Completed"] - out["Task Completed"]).abs().dt.total_seconds()
    / 3600,
    (out["Upload Completed"] - out["Task Completed"]).abs().dt.total_seconds()
    / 3600,
)

print(out)

Prints:

ID Category Site Task Completed Access Completed Upload Completed Time Difference
0 1 A X 2022-01-03 12:00:00 2022-01-03 12:00:00 2022-01-02 12:00:00 0
1 1 A Y 2022-01-03 12:00:00 2022-01-02 12:00:00 2022-01-01 12:00:00 24
2 1 B X 2022-01-03 12:00:00 2022-01-02 12:00:00 2022-01-03 12:00:00 0
3 2 A X 2022-01-05 12:00:00 2022-01-04 12:00:00 2022-01-04 12:00:00 24
Answered By: Andrej Kesely

@Andrej Kesely’s answer is already covered most part. Yet, if you would like to do this without numpy, you can modify the numpy operation part as

max_time = out[["Access Completed", "Upload Completed"]].max(axis=1)
out["Time Difference"] = (out["Task Completed"] - max_time).dt.total_seconds() / 3600
Answered By: Emma
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.