How to assign unique grouping value for each sequence of consecutive True values in pandas boolean mask

Question:

I am trying to generate an appropriate pandas groupBy

Say I have a boolean mask like so [false, false, true, false, true, true, false, true, true]

I would like the groupings to be like so [0,0,1,0,2,2,0,3,3]

I can certainly create this array via a loop through the mask but I would like if possible to use the pandas or numpy builtins for ease of use and perhaps vectorization.

(If no builtin exists I would appreciate a more pythonic way of doing this than via a straight loop with a state flag and rank counter)

Asked By: naftalimich

||

Answers:

Original Answer:

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()

Slightly Modified Original Answer (works if first item is True):

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
dsort = d.sort_values()
dindex = dsort.index
pd.Series(pd.factorize(dsort)[0],index = dindex).sort_index().tolist()

Alternative way:

Generate List and put into series.

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)

Find items that are sequential

d = s.astype(int).diff().ne(0).cumsum().reset_index()

Locate the first True in each group

d.loc[s].groupby(0)['index'].first().rename_axis(None)

Factorize new grouping and put into series

f = pd.factorize(d.loc[s].groupby(0)['index'].first().rename_axis(None))
s2 = pd.Series(f[0]+1,index = f[1])

Use reindex and forward fill all the missing spaces. Fill any NaN’s with 0. Lastly replace all places that were False with zeros.

s2 = s2.reindex(s.index).fillna(method='ffill').fillna(0)
s2.loc[~(s)] = 0
s2.tolist()
Answered By: rhug123

OP here, see the accepted answer.

I just spent sometime with it and wanted to say a word about how it works so hopefully it will come a bit easier to the next reader who wants to work it out.

Here is the code for reference:

l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()

The key code pieces are the diff followed by the ne(0) followed by the cumsum()

Here are the key insights.

diff:
After the diff, the array will take on the following values:

The first value will take on NAN (because there is no preceding value to diff it with) and thereafter every True will take on a 1 if it follows a False or 0 if it follows another True.

So all but one sequence of True values will look like this:

1, 0, 0, 0

Or, when a sequence of True begins with the first element in the array

NAN, 0, 0, 0

(Nearly the same logic applies to the sequences of False values but we normalize those at the end with the d.loc[~(s)] = 0 statement)

ne(0) normalizes NAN and 1 because they both != 0.

cumsum() assigns a value (equal to one greater than the previous) to the first True in the sequence of True values that carries forward for all the other True values that are part of its sequence group (since their value from the diff call is 0).

So now we have what we want, with all sequences of True values in the series mapped to a unique integer.

Then we make the call to d.loc[~(s)] = 0 which assigns all the False values to group 0.

We could stop here (without making the pd.factorize call) which would output [0, 0, 2, 0, 4, 4, 0, 6, 6] for the [False, False, True, False, True, True, False, True, True] I posited in the question, but to get the output to match the output I stipulated in the question, you need to call pd.factorize

A word of caution for calling pd.factorize:

If the first value in the input array is True rather than False you will not have the False values mapping to the zero group, which in my use case recommended doing without making that call.

Answered By: naftalimich

If using skimage library is not an issue, You can do this:

from skimage import measure
l = [False, False, True, False, True, True, False, True, True]
labels = measure.label(np.array(l))
out: array([0, 0, 1, 0, 2, 2, 0, 3, 3], dtype=int32)
Answered By: yazan sayed
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.