Obtaining timeframe for ID groups based on state change

Question:

First off, my apologies, I’m a complete novice when it comes to Python. I use it extremely infrequently but require it for this problem.

I have a set of data which looks like the below:

id state dt
101 0 2022-15
101 1 2022-22
101 0 2022-26
102 0 2022-01
102 1 2022-41
103 1 2022-03
103 0 2022-12

I need to provide an output which displays the amount of time each ID was state = "1". E.G for ID 101 – state1_start_dt = "2022_22", state1_end_dt = "2022_25".

The data is in .CSV format. I’ve attempted to bring this in via Pandas, utilise groupby on the df and then loop over this – however this seems extremely slow.

I’ve come across Finite State Machines which seem to link to my requirements, however I’m in way over my head attempting to create a Finite State Machine in Python which accepts .CSV inputs, provides output per each ID group as well as incorporates logic to account for scenarios where the last entry for an ID is state = "1" – therefore we’d assume the time frame was until the end of 2022.

If anyone can provide some sources or sample code which I can break down to get a better understanding – that would be great.

EDIT

Some examples to be clearer on what I’d like to achieve:

-For IDs that have no ending 0 in the state sequence, the state1_end_dt should be entered as ‘2022-52’ (the final week in 2022)

-For IDs which have alternating states, we can incorporate a second, third, forth etc.. set of columns (E.G state1_start_dt_2, state1_end_dt_2). This will allow each window to be accounted for. For any entries that only have one window, these extra columns can be NULL.

-For IDs which have no "1" present in the state column, these can be skipped.

-For IDs which do not have any 0 states present, the minimum dt value should be taken as the state1_start_dt and ‘2022-52’ can be entered for state1_end_dt

Asked By: Pheonix

||

Answers:

If the csv file called one_zero.csv is this

id,state,dt
100,0,2022-15
100,1,2022-22
100,0,2022-26
101,0,2022-01
101,1,2022-41
102,1,2022-03
102,0,2022-12
102,1,2022-33

(I’ve added one additional item to the end.)

Then this code gives you what you want.

import pandas as pd

df = pd.read_csv("one_zero.csv")
result = {}
for id_, sub_df in df.groupby('id'):
    sub_df = sub_df.sort_values("dt")
    intervals = []
    start_dt = None
    for state, dt in zip(sub_df["state"], sub_df["dt"]):
        if state == 1:
            start_dt = dt
        if state == 0 and start_dt is not None:
            week = int(dt.split("-", maxsplit=1)[1])
            intervals.append((start_dt, f"2022-{week-1:02d}"))
            start_dt = None
    if start_dt is not None:
        intervals.append((start_dt, "2022-52"))
    result[id_] = intervals

At the end the result dictionary will contain this:

{
 100: [('2022-22', '2022-25')],
 101: [('2022-41', '2022-52')],
 102: [('2022-03', '2022-11'), ('2022-33', '2022-52')]
}

With this groupby and sort_values it works even if you shuffle the lines in the csv file. I’ve used formatted string to fix the week number. 02d there means there, that the week will be always two digits, starting with 0 for the first 9 weeks.

I guess you need less memory if you iterate on the rows like this, but for me the zip version is more familiar.

    for _, row in sub_df.iterrows():
        state = row["state"]
        dt = row["dt"]

IIUC, here are some functions to perform the aggregation you are looking for.

First, we convert the strings '%Y-%W' (e.g. '2022-15') into a DateTime (the Monday of that week), e.g. '2022-04-11', as it is easier to deal with actual dates than these strings. This makes this solution generic in that it can have arbitrary dates in it, not just for a single year.

Second, we augment the df with a "sentinel": a row for each id that is on the first week of the next year (next year being max year of all dates, plus 1) with state = 0. That allows us to not worry whether a sequence ends with 0 or not.

Then, we essentially group by id and apply the following logic: keep only transitions, so, e.g., [1,1,1,0,0,1,0] becomes [1,.,.,0,.,1,0] (where '.' indicates dropped values). That gives us the spans we are looking for (after subtracting one week for the 0 states).

Edit: speedup: instead of applying the masking logic to each group, we detect transitions globally (on the sentinel-augmented df, sorted by ['id', 'dt', 'state']). Since each id sequence in the augmented df ends with the sentinel (0), we are guaranteed to catch the first 1 of the next id.

Putting it all together, including a postproc() to convert dates back into strings of year-week:

def preproc(df):
    df = df.assign(dt=pd.to_datetime(df['dt'] + '-Mon', format='%Y-%W-%a'))
    max_year = df['dt'].max().year
    # first week next year:
    tmax = pd.Timestamp(f'{max_year}-12-31') + pd.offsets.Week(1)
    sentinel = pd.DataFrame(
        pd.unique(df['id']),
        columns=['id']).assign(state=0, dt=tmax)
    df = pd.concat([df, sentinel])
    df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
    return df

# speed up
def proc(df):
    mask = df['state'] != df['state'].shift(fill_value=0)
    df = df[mask]
    z = df.assign(c=df.groupby('id').cumcount()).set_index(['c', 'id'])['dt'].unstack('c')
    z[z.columns[1::2]] -= pd.offsets.Week(1)
    cols = [
        f'{x}_{i}'
        for i in range(len(z.columns) // 2)
        for x in ['start', 'end']
    ]
    return z.set_axis(cols, axis=1)

def asweeks_str(t, nat='--'):
    return f'{t:%Y-%W}' if t and t == t else nat

def postproc(df):
    # convert dates into strings '%Y-%W'
    return df.applymap(asweeks_str)

Examples

First, let’s use the example that is in the original question. Note that this doesn’t exemplifies some of the corner cases we are able to handle (more on that in a minute).

df = pd.DataFrame({
    'id': [101, 101, 101, 102, 102, 103, 103],
    'state': [0, 1, 0, 0, 1, 1, 0],
    'dt': ['2022-15', '2022-22', '2022-26', '2022-01', '2022-41', '2022-03', '2022-12'],
})

>>> postproc(proc(preproc(df)))
     start_0    end_0
id                   
101  2022-22  2022-25
102  2022-41  2022-52
103  2022-03  2022-11

But let’s generate some random data, to observe some corner cases:

def gen(n, nids=2):
    wk = np.random.randint(1, 53, n*nids)
    st = np.random.choice([0, 1], n*nids)
    ids = np.repeat(np.arange(nids) + 101, n)
    df = pd.DataFrame({
        'id': ids,
        'state': st,
        'dt': [f'2022-{w:02d}' for w in wk],
    })
    df = df.sort_values(['id', 'dt', 'state']).reset_index(drop=True)
    return df

Now:

np.random.seed(0)  # reproducible example
df = gen(6, 3)

>>> df
     id  state       dt
0   101      0  2022-01
1   101      0  2022-04
2   101      1  2022-04
3   101      1  2022-40
4   101      1  2022-45
5   101      1  2022-48
6   102      1  2022-10
7   102      1  2022-20
8   102      0  2022-22
9   102      1  2022-24
10  102      0  2022-37
11  102      1  2022-51
12  103      1  2022-02
13  103      0  2022-07
14  103      0  2022-13
15  103      1  2022-25
16  103      1  2022-25
17  103      1  2022-39

There are several interesting things here. First, 101 starts with a 0 state, whereas 102 and 103 both start with 1. Then, there are repeated ones for all ids. There are also repeated weeks: '2022-04' for 101 and '2022-25' for 103.

Nevertheless, the aggregation works just fine and produces:

>>> postproc(proc(preproc(df)))
     start_0    end_0  start_1    end_1  start_2    end_2
id                                                       
101  2022-04  2022-52       --       --       --       --
102  2022-10  2022-21  2022-24  2022-36  2022-51  2022-52
103  2022-02  2022-06  2022-25  2022-52       --       --

Speed

np.random.seed(0)
n = 10
k = 100_000
df = gen(n, k)
%timeit preproc(df)
483 ms ± 4.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The processing itself takes less than 200ms for 1 million rows:

a = preproc(df)

%timeit proc(a)
185 ms ± 284 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As for the post-processing (converting dates back to year-week strings), it is the slowest thing of all:

b = proc(a)

%timeit postproc(b)
1.63 s ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

For a speed-up of that post-processing, we can rely on the fact that there are only a small number of distinct dates that are week-starts (52 per year, plus NaT for the blank cells):

def postproc2(df, nat='--'):
    dct = {
        t: f'{t:%Y-%W}' if t and t == t else nat
        for t in df.stack().reset_index(drop=True).drop_duplicates()
    }
    return df.applymap(dct.get)

%timeit postproc2(b)
542 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

We could of course do something similar for preproc().

Answered By: Pierre D

Another alternative:

res = (
    df.drop(columns="dt")
    .assign(week=df["dt"].str.split("-").str[1].astype("int"))
    .sort_values(["id", "week"])
    .assign(group=lambda df:
        df.groupby("id")["state"].diff().fillna(1).ne(0).cumsum()
    )
    .drop_duplicates(subset="group", keep="first")
    .loc[lambda df: df["state"].eq(1) | df["id"].eq(df["id"].shift())]
    .assign(no=lambda df: df.groupby("id")["state"].cumsum())
    .pivot(index=["id", "no"], columns="state", values="week")
    .rename(columns={0: "end", 1: "start"}).fillna("52").astype("int")
)[["start", "end"]]
  • First add new column week and sort along id and week. (The sorting might not be necessary if the data already come sorted.)
  • Then look id-group-wise for blocks of consecutive 0 or 1 and based on the result (stored in the new column group) drop all resp. duplicates while keeping the firsts (the others aren’t relevant according to the logic you’ve layed out).
  • Afterwards also remove the 0-states at the start of an id-group.
  • On the result identify id-group-wise the connected startend groups (store in new group no).
  • Then .pivot the thing: pull id and no in the index and state into the columns.
  • Afterwards fill the NaN with 52 and do some casting, renaminig, and sorting to get the result in better shape.

If you really want to move the various startend-combinations into columns then replace below the pivot line as follows:

res = (
    ...
    .pivot(index=["id", "no"], columns="state", values="week")
    .rename(columns={0: 1, 1: 0}).fillna("52").astype("int")
    .unstack().sort_index(level=1, axis=1)
)
res.columns = [f"{'start' if s == 0 else 'end'}_{n}" for s, n in res.columns]

Results with the dataframe from @Pierre’s answer:

state   start  end
id  no            
101 1       4   52
102 1      10   22
    2      24   37
    3      51   52
103 1       2    7
    2      25   52

or

     start_1  end_1  start_2  end_2  start_3  end_3
id                                                 
101      4.0   52.0      NaN    NaN      NaN    NaN
102     10.0   22.0     24.0   37.0     51.0   52.0
103      2.0    7.0     25.0   52.0      NaN    NaN
Answered By: Timus
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.