how to select group of rows from a dataframe if all rows follow a sequence

Question:

I’m currently working on a dataframe that has processes (based on ID) that may or not reach the end of the process. The end of the process is defined as the activity which has index=6. What i need to do is to filter those processes (ID) based on the fact they are completed, which means all 6 the activities are done (so in the process we’ll have activities which have index equal to 1,2,3,4,5 and 6 in this specific order).

the dataframe is structured as follows:

ID          A  index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
2   activity7      1    
2   activity8      2    
2   activity9      3    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

And the resulting dataframe should be:

ID          A   index           
1   activity1      1 
1   activity2      2    
1   activity3      3    
1   activity4      4    
1   activity5      5    
1   activity6      6    
3   activity10     1    
3   activity11     2    
3   activity12     3  
3   activity13     4    
3   activity14     5    
3   activity15     6    

I’ve tried to do so working with sum(), creating a new column ‘a’ and checking if the sum of every group was greater than 20 (which means taking groups in which the sum() is at least 21, which is the sum of 1,2,3,4,5,6) with the function gt().

df['a'] = df['index'].groupby(df['index']).sum()
df2 = df[df['a'].gt(20)] 

Probably this isn’t the best approach, so also other approaches are more than welcome.
Any idea on how to select rows based on this condition?

Asked By: mattiadt

||

Answers:

this may not be the fastest method, especially on a large dataframe, but it does the job

df = df.loc[df.groupby(['ID'])['index'].transform(lambda x: list(x)==list(range(1,7)))]

Or this other variation:

df = df.loc[df.groupby('ID')['index'].filter(lambda x: list(x)==list(range(1,7))).index]

Output:


ID  A   index
0   1   activity1   1
1   1   activity2   2
2   1   activity3   3
3   1   activity4   4
4   1   activity5   5
5   1   activity6   6
9   3   activity10  1
10  3   activity11  2
11  3   activity12  3
12  3   activity13  4
13  3   activity14  5
14  3   activity15  6
Answered By: Yolao_21

Another possible solution:

out = (df.groupby('ID')
       .filter(lambda g: (len(g['index']) == 6) and 
       (g['index'].eq([*range(1,7)]).all())))

print(out)

   ID           A  index
0    1   activity1      1
1    1   activity2      2
2    1   activity3      3
3    1   activity4      4
4    1   activity5      5
5    1   activity6      6
9    3  activity10      1
10   3  activity11      2
11   3  activity12      3
12   3  activity13      4
13   3  activity14      5
14   3  activity15      6
Answered By: PaulS
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.