pandas: add a grouping variable to clusters of rows that meet a criteria

Question:

I have this dataframe:

df = pd.DataFrame({'forms_a_cluster': [False, False, True, True, True, False, False, False,
True, True, False, True, True, True, False],
'cluster_number':[False, False, 1, 1, 1, False, False, False,
              2, 2, False, 3, 3, 3, False]})

The idea is that I have some criteria which, when certain rows have met it, selects those cases as True, and when consecutive rows meet the criteria, they then form a cluster. I want to be able to label each cluster as cluster_1, cluster_2, cluster_3 etc. I’ve given an example of the hoped for output with the column cluster_number. But I have no idea how to do this, given that in the real data, I have to do it many times on different datasets which have a different number of rows and the cluster sizes will be different every time. Do you have any idea how to go about this? Thanks in advance!

Asked By: steve zissou

||

Answers:

You can use a groupby.ngroup on the groups of successive values pre-filtered to the True:

# group by successive values
m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()

# filter groups of True, add group number
# fill values with False
df['cluster_number'] = (m[df['forms_a_cluster']]
                        .groupby(m).ngroup().add(1)
                        .reindex(df.index, fill_value=False)
                        )

Or with arithmetics:

m = df['forms_a_cluster'].ne(df['forms_a_cluster'].shift()).cumsum()
df['cluster_number'] = (df['forms_a_cluster']
                        .mask(df['forms_a_cluster'],
                              m//2 + df['forms_a_cluster'].iloc[0])
                       )

Output:

    forms_a_cluster cluster_number
0             False          False
1             False          False
2              True              1
3              True              1
4              True              1
5             False          False
6             False          False
7             False          False
8              True              2
9              True              2
10            False          False
11             True              3
12             True              3
13             True              3
14            False          False

Other example:

    forms_a_cluster cluster_number
0              True              1
1              True              1
2             False          False
3              True              2
4              True              2
5             False          False
6             False          False
7             False          False
8              True              3
9              True              3
10            False          False
11             True              4
12             True              4
13             True              4
14            False          False
Answered By: mozway

Create consecutive groups and add factorize for starting groups by 1 with consecutive groups by Series.cumsum with inverted:

#mask
m = df['forms_a_cluster']
#create column filled by original data
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
    forms_a_cluster cluster_number
0             False          False
1             False          False
2              True              1
3              True              1
4              True              1
5             False          False
6             False          False
7             False          False
8              True              2
9              True              2
10            False          False
11             True              3
12             True              3
13             True              3
14            False          False

m = df['forms_a_cluster']
df['cluster_number'] = df['forms_a_cluster']
df.loc[m, 'cluster_number'] = pd.factorize((~m).cumsum()[m])[0] + 1
print (df)
    forms_a_cluster cluster_number
0              True              1
1             False          False
2              True              2
3              True              2
4              True              2
5             False          False
6             False          False
7             False          False
8              True              3
9              True              3
10            False          False
11             True              4
12             True              4
13             True              4
14            False          False
Answered By: jezrael
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.