Pandas – Irregular time series data – Compare row to next 8 hours of rows

Question:

Right now I am using pandas to analyze call center data. The data is structured as followed:


call_time = pd.to_datetime([
         '2020-01-01 01:00:00',
         '2020-01-01 09:00:00',
         '2020-01-01 01:00:00',
         '2020-01-01 03:00:00',
         '2020-01-01 04:00:00',
         '2020-01-01 06:00:00',
         '2020-01-01 01:00:00',
         '2020-01-01 10:00:00',
])
df = pd.DataFrame({'phone_number': ['a', 'a', 'b', 'b', 'b', 'b', 'c', 'c'],
                  'was_answered': [False, True, False, False, True, False, False, True],},
                  index=call_time)

>>> df
                    phone_number  was_answered
2020-01-01 01:00:00            a         False
2020-01-01 09:00:00            a          True
2020-01-01 01:00:00            b         False
2020-01-01 03:00:00            b         False
2020-01-01 04:00:00            b          True
2020-01-01 06:00:00            b         False
2020-01-01 01:00:00            c         False
2020-01-01 10:00:00            c          True

Each row represents a call at the call center.

The index (call_time) is the time the call started. phone_number shows the hashed phone numbers. was_answered denotes if the call was answered.

The data was sorted by phone_number first and then the call_time. Notice that the call_time is therefore not chronological.

What I would like to do is identify the return callers. Return callers are callers that were not answered the first time, but called back within 8 hours.

On top of this I need the time from the first call until the next call, regardless of wether the call was answered.

So it would look something like this:

                    phone_number  was_answered is_return_call time_until_return
2020-01-01 01:00:00            a         False          False              Null
2020-01-01 09:00:00            a          True           True          08:00:00
2020-01-01 01:00:00            b         False          False              Null
2020-01-01 03:00:00            b         False           True          02:00:00
2020-01-01 04:00:00            b          True           True          01:00:00
2020-01-01 06:00:00            b         False          False              Null
2020-01-01 01:00:00            c         False          False              Null
2020-01-01 10:00:00            c          True          False              Null

I have tried many thing. Does anyone know how to do this? If my goal is unclear, please let me know!

Asked By: Laevae

||

Answers:

You can solve this with rolling windows:

was_answered = df.groupby("phone_number", group_keys=True)["was_answered"]

# When the call has never been answered in the previous 8
# hours, it's a return call. Since we use closed="left", if
# it's the first call in 8 hours, the window is empty, its
# sum is NaN and hence not considered a return call.
is_return_call = was_answered.rolling("8H", closed="left").sum().eq(0)

# Time difference since previous call
time_since_last_call = was_answered.apply(lambda w: w.index.to_series().diff())

result = pd.concat(
    [
        was_answered.rolling(1).sum().astype("bool"),
        is_return_call.rename("is_return_call"),
        # If it's not a return call, time_until_return = NA
        time_since_last_call.where(is_return_call, None).rename("time_until_return"),
    ],
    axis=1,
)

# Alternatively, you can extract `was_answered` from the
# original frame but with modified index so that it can line
# up with the other 2 series
result = pd.concat(
    [   
        df.set_index("phone_number", append=True).swaplevel()["was_answered"],
        is_return_call.rename("is_return_call"),
        time_since_last_call.where(is_return_call, None).rename("time_until_return"),
    ],
    axis=1,
)

Result:

                                  was_answered  is_return_call time_until_return
phone_number                                                                    
a            2020-01-01 01:00:00         False           False               NaT
             2020-01-01 09:00:00          True            True   0 days 08:00:00
b            2020-01-01 01:00:00         False           False               NaT
             2020-01-01 03:00:00         False            True   0 days 02:00:00
             2020-01-01 04:00:00          True            True   0 days 01:00:00
             2020-01-01 06:00:00         False           False               NaT
c            2020-01-01 01:00:00         False           False               NaT
             2020-01-01 10:00:00          True           False               NaT
Answered By: Code Different

Try:

def my_func(x, t=pd.Timedelta(hours=8)):
    out, last_time, was_answered = [], None, False
    for cnt, (i, a) in enumerate(zip(x.index, x["was_answered"])):
        if cnt == 0:
            out.append(False)
        else:
            out.append(not was_answered if i - last_time <= t else False)
        last_time, was_answered = i, a
    return out


g = df.groupby("phone_number", group_keys=False)
df["is_return_call"] = g.apply(lambda x: pd.Series(my_func(x), index=x.index))
df["time_until_return"] = g.apply(lambda x: x.index.to_frame().diff()).mask(
    ~df["is_return_call"]
)
print(df)

Prints:

                    phone_number  was_answered  is_return_call time_until_return
2020-01-01 01:00:00            a         False           False               NaT
2020-01-01 09:00:00            a          True            True   0 days 08:00:00
2020-01-01 01:00:00            b         False           False               NaT
2020-01-01 03:00:00            b         False            True   0 days 02:00:00
2020-01-01 04:00:00            b          True            True   0 days 01:00:00
2020-01-01 06:00:00            b         False           False               NaT
2020-01-01 01:00:00            c         False           False               NaT
2020-01-01 10:00:00            c          True           False               NaT
Answered By: Andrej Kesely