How to assign unique grouping value for each sequence of consecutive True values in pandas boolean mask
Question:
I am trying to generate an appropriate pandas groupBy
Say I have a boolean mask like so [false, false, true, false, true, true, false, true, true]
I would like the groupings to be like so [0,0,1,0,2,2,0,3,3]
I can certainly create this array via a loop through the mask but I would like if possible to use the pandas or numpy builtins for ease of use and perhaps vectorization.
(If no builtin exists I would appreciate a more pythonic way of doing this than via a straight loop with a state flag and rank counter)
Answers:
Original Answer:
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()
Slightly Modified Original Answer (works if first item is True):
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
dsort = d.sort_values()
dindex = dsort.index
pd.Series(pd.factorize(dsort)[0],index = dindex).sort_index().tolist()
Alternative way:
Generate List and put into series.
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
Find items that are sequential
d = s.astype(int).diff().ne(0).cumsum().reset_index()
Locate the first True
in each group
d.loc[s].groupby(0)['index'].first().rename_axis(None)
Factorize new grouping and put into series
f = pd.factorize(d.loc[s].groupby(0)['index'].first().rename_axis(None))
s2 = pd.Series(f[0]+1,index = f[1])
Use reindex and forward fill all the missing spaces. Fill any NaN’s with 0. Lastly replace all places that were False
with zeros.
s2 = s2.reindex(s.index).fillna(method='ffill').fillna(0)
s2.loc[~(s)] = 0
s2.tolist()
OP here, see the accepted answer.
I just spent sometime with it and wanted to say a word about how it works so hopefully it will come a bit easier to the next reader who wants to work it out.
Here is the code for reference:
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()
The key code pieces are the diff
followed by the ne(0)
followed by the cumsum()
Here are the key insights.
diff
:
After the diff, the array will take on the following values:
The first value will take on NAN
(because there is no preceding value to diff it with) and thereafter every True will take on a 1
if it follows a False
or 0 if it follows another True
.
So all but one sequence of True
values will look like this:
1, 0, 0, 0
Or, when a sequence of True
begins with the first element in the array
NAN, 0, 0, 0
(Nearly the same logic applies to the sequences of False
values but we normalize those at the end with the d.loc[~(s)] = 0
statement)
ne(0)
normalizes NAN and 1 because they both !=
0.
cumsum()
assigns a value (equal to one greater than the previous) to the first True
in the sequence of True
values that carries forward for all the other True values that are part of its sequence group (since their value from the diff
call is 0
).
So now we have what we want, with all sequences of True
values in the series mapped to a unique integer.
Then we make the call to d.loc[~(s)] = 0
which assigns all the False
values to group 0.
We could stop here (without making the pd.factorize
call) which would output [0, 0, 2, 0, 4, 4, 0, 6, 6]
for the [False, False, True, False, True, True, False, True, True]
I posited in the question, but to get the output to match the output I stipulated in the question, you need to call pd.factorize
A word of caution for calling pd.factorize
:
If the first value in the input array is True
rather than False
you will not have the False
values mapping to the zero group, which in my use case recommended doing without making that call.
If using skimage
library is not an issue, You can do this:
from skimage import measure
l = [False, False, True, False, True, True, False, True, True]
labels = measure.label(np.array(l))
out: array([0, 0, 1, 0, 2, 2, 0, 3, 3], dtype=int32)
I am trying to generate an appropriate pandas groupBy
Say I have a boolean mask like so [false, false, true, false, true, true, false, true, true]
I would like the groupings to be like so [0,0,1,0,2,2,0,3,3]
I can certainly create this array via a loop through the mask but I would like if possible to use the pandas or numpy builtins for ease of use and perhaps vectorization.
(If no builtin exists I would appreciate a more pythonic way of doing this than via a straight loop with a state flag and rank counter)
Original Answer:
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()
Slightly Modified Original Answer (works if first item is True):
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
dsort = d.sort_values()
dindex = dsort.index
pd.Series(pd.factorize(dsort)[0],index = dindex).sort_index().tolist()
Alternative way:
Generate List and put into series.
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
Find items that are sequential
d = s.astype(int).diff().ne(0).cumsum().reset_index()
Locate the first True
in each group
d.loc[s].groupby(0)['index'].first().rename_axis(None)
Factorize new grouping and put into series
f = pd.factorize(d.loc[s].groupby(0)['index'].first().rename_axis(None))
s2 = pd.Series(f[0]+1,index = f[1])
Use reindex and forward fill all the missing spaces. Fill any NaN’s with 0. Lastly replace all places that were False
with zeros.
s2 = s2.reindex(s.index).fillna(method='ffill').fillna(0)
s2.loc[~(s)] = 0
s2.tolist()
OP here, see the accepted answer.
I just spent sometime with it and wanted to say a word about how it works so hopefully it will come a bit easier to the next reader who wants to work it out.
Here is the code for reference:
l = [False, False, True, False, True, True, False, True, True]
s = pd.Series(l)
d = s.astype(int).diff().ne(0).cumsum()
d.loc[~(s)] = 0
pd.factorize(d)[0].tolist()
The key code pieces are the diff
followed by the ne(0)
followed by the cumsum()
Here are the key insights.
diff
:
After the diff, the array will take on the following values:
The first value will take on NAN
(because there is no preceding value to diff it with) and thereafter every True will take on a 1
if it follows a False
or 0 if it follows another True
.
So all but one sequence of True
values will look like this:
1, 0, 0, 0
Or, when a sequence of True
begins with the first element in the array
NAN, 0, 0, 0
(Nearly the same logic applies to the sequences of False
values but we normalize those at the end with the d.loc[~(s)] = 0
statement)
ne(0)
normalizes NAN and 1 because they both !=
0.
cumsum()
assigns a value (equal to one greater than the previous) to the first True
in the sequence of True
values that carries forward for all the other True values that are part of its sequence group (since their value from the diff
call is 0
).
So now we have what we want, with all sequences of True
values in the series mapped to a unique integer.
Then we make the call to d.loc[~(s)] = 0
which assigns all the False
values to group 0.
We could stop here (without making the pd.factorize
call) which would output [0, 0, 2, 0, 4, 4, 0, 6, 6]
for the [False, False, True, False, True, True, False, True, True]
I posited in the question, but to get the output to match the output I stipulated in the question, you need to call pd.factorize
A word of caution for calling pd.factorize
:
If the first value in the input array is True
rather than False
you will not have the False
values mapping to the zero group, which in my use case recommended doing without making that call.
If using skimage
library is not an issue, You can do this:
from skimage import measure
l = [False, False, True, False, True, True, False, True, True]
labels = measure.label(np.array(l))
out: array([0, 0, 1, 0, 2, 2, 0, 3, 3], dtype=int32)