How do I filter something in pandas to so that it outputs the first and last date time for every status stage a product is in?

Question:

I have this dataframe:

|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Banana  | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------|
|Apple   | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------| 
|Apple   | Good    | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------| 
|Apple   | Good    | 2002-01-01 | 2023-09-28|
|--------|---------|------------|-----------| 
|Banana  | Bad     | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------|
|Banana  | Good    | 2002-01-01 | NaT       |

I want to implement a groupby function that outputs the product and the earliest start date and last end date, even if it’s a NaT.

I tried this code and it kind of works but it misses the NaT value for Banana.

result_df = df.groupby(["Product", "Status"]).agg({"Start_Date": "min", "End_Date": "last"}).reset_index()

Ultimately, what I’m looking for is an output like this:

For Apple:
|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Apple   | Good    | 2000-01-01 | 2023-09-23|
|--------|---------|------------|-----------|


for banana:
|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Banana  | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------|
|Banana  | Bad     | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------|
|Banana  | Good    | 2002-01-01 | NaT       |

If Apple had the NaT, I’d also want to be able to output that on a single line/row:

|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Banana  | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------|
|Apple   | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------| 
|Apple   | Good    | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------| 
|Apple   | Good    | 2002-01-01 | NaT       |
|--------|---------|------------|-----------| 
|Banana  | Bad     | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------|
|Banana  | Good    | 2002-01-01 | NaT       |

Intended output:

For Apple:
|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Apple   | Good    | 2000-01-01 | NaT       |
|--------|---------|------------|-----------|

for banana:
|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Banana  | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------|
|Banana  | Bad     | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------|
|Banana  | Good    | 2002-01-01 | NaT       |

But what I get is:

|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Apple   | Good    | 2000-01-01 | 2001-12-31|
|--------|---------|------------|-----------|
|Apple   | Good    | 2002-01-01 | NaT       |
|--------|---------|------------|-----------|

for banana:
|Product | Status  | Start_Date |  End_Date |
|--------|---------|------------|-----------|
|Banana  | Good    | 2000-01-01 | 2000-12-31|
|--------|---------|------------|-----------|
|Banana  | Bad     | 2001-01-01 | 2001-12-31|
|--------|---------|------------|-----------|
|Banana  | Good    | 2002-01-01 | NaT       |

The banana output is correct, not the apple. Is it possible to get the desired output?

Thank you in advance!

Asked By: PRuss

||

Answers:

Try to insert one more condition to .groupby:

df = (
    df.groupby(["Product", "Status", df["End_Date"].isna().values])
    .agg({"Start_Date": "min", "End_Date": "last"})
    .droplevel(2)
    .reset_index()
)

print(df)

Prints:

  Product Status Start_Date   End_Date
0   Apple   Good 2000-01-01 2023-09-28
1  Banana    Bad 2001-01-01 2001-12-31
2  Banana   Good 2000-01-01 2000-12-31
3  Banana   Good 2002-01-01        NaT

EDIT: If you want to merge overlapping intervals:

def group_func(g):
    g["End_Date"] = g["End_Date"].fillna(pd.Timestamp("2199-01-01"))

    g2 = (g["Start_Date"] > (g["End_Date"] + pd.Timedelta(days=1)).shift()).cumsum()
    out = g.groupby(g2).agg({"Start_Date": "min", "End_Date": "last"})

    out["End_Date"] = out["End_Date"].replace(pd.Timestamp("2199-01-01"), pd.NaT)
    return out


df = df.groupby(["Product", "Status"]).apply(group_func).droplevel(2).reset_index()

Prints:

  Product Status Start_Date   End_Date
0   Apple   Good 2000-01-01        NaT
1  Banana    Bad 2001-01-01 2001-12-31
2  Banana   Good 2000-01-01 2000-12-31
3  Banana   Good 2002-01-01        NaT
Answered By: Andrej Kesely

You can achieve this using the groupby and aggregation functions in pandas.

df = pd.DataFrame(data)

df['Start_Date'] = pd.to_datetime(df['Start_Date'])
df['End_Date'] = pd.to_datetime(df['End_Date'])
def earliest_start_and_latest_end(series):
    earliest_start = series.min()
    latest_end = series.max()
    return earliest_start, latest_end
result = df.groupby(['Product', 'Status'])['Start_Date', 'End_Date'].agg(earliest_start_and_latest_end)
result = result.reset_index()

print(result)

Answered By: krishna veer
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.