Group by and find consecutive time and create a flag in Python

Question:

The following is the data I am having,

id  name    unused      time
1   a         1     2/21/2017 18:01:31.168
1   a         2     2/21/2017 18:01:31.168
1   a         3     2/21/2017 18:11:44.054
1   a         4     2/21/2017 18:19:03.147
1   b         5     2/21/2017 18:19:03.147
1   b         6     2/21/2017 21:55:43.927
1   b         7     2/21/2017 22:10:29.699
1   b         8     2/21/2017 22:10:29.699
2   a         9     2/21/2017 23:36:30.239
2   a        10     2/21/2017 23:45:40.005
2   a        11     2/22/2017 00:05:43.466
2   a        12     2/22/2017 00:05:43.466
2   b        13     2/22/2017 00:16:00.646
2   b        14     2/22/2017 11:43:16.250
2   b        15     2/22/2017 11:43:16.250
2   b        16     2/22/2017 14:02:10.531

I want to group it with id, name and look for consecutive time stamps and create a flag for it. For example, the 1st and the 2nd row have same id, name and time so I want 1 for both the values where if it is not consecutive, it should be 0.

The following is the output I am trying to achieve,

id  name    unused      time               flag
1   a         1     2/21/2017 18:01:31.168  1
1   a         2     2/21/2017 18:01:31.168  1
1   a         3     2/21/2017 18:11:44.054  0
1   a         4     2/21/2017 18:19:03.147  0
1   b         5     2/21/2017 18:19:03.147  0
1   b         6     2/21/2017 21:55:43.927  0
1   b         7     2/21/2017 22:10:29.699  1
1   b         8     2/21/2017 22:10:29.699  1
2   a         9     2/21/2017 23:36:30.239  0
2   a        10     2/21/2017 23:45:40.005  0
2   a        11     2/22/2017 00:05:43.466  1
2   a        12     2/22/2017 00:05:43.466  1
2   b        13     2/22/2017 00:16:00.646  0
2   b        14     2/22/2017 11:43:16.250  1
2   b        15     2/22/2017 11:43:16.250  1
2   b        16     2/22/2017 14:02:10.531  0

The following is my trying,

I am trying this for sorting it,

data.sort_values(['id', 'name', 'time'])

Then I want to group it,

data.sort_values(['id', 'name', 'time']).groupby(['id', 'name'])

But I am not able to create the flag after that. I am thinking of a solution where I can write a for loop and loop through all the values and check for the condition. But I am thinking there should be a efficient solution because I need to find it for million rows.

Can anybody help me in solving this?

Thanks

Asked By: Observer

||

Answers:

One approach may be to just use shift to compare one forward and one behind with your columns of interest.

eval_cols = df[['id', 'name', 'time']]
df['flag'] = ((eval_cols == eval_cols.shift()).all(1) | 
              (eval_cols == eval_cols.shift(-1)).all(1)).astype(int)

Demo

>>> ((eval_cols == eval_cols.shift()).all(1) | 
     (eval_cols == eval_cols.shift(-1)).all(1)).astype(int)

0     1
1     1
2     0
3     0
4     0
5     0
6     1
7     1
8     0
9     0
10    1
11    1
12    0
13    1
14    1
15    0
dtype: int32
Answered By: miradulo
col1=df1[['id','name','time']].astype(str).apply('*'.join,1)
col2=col1.ne(col1.shift()).cumsum()
df1.assign(flag=col2).groupby(col2).apply(lambda dd:dd.assign(flag=1) if len(dd)>=2 else dd.assign(flag=0))

out:

id  name    unused      time               flag
1   a         1     2/21/2017 18:01:31.168  1
1   a         2     2/21/2017 18:01:31.168  1
1   a         3     2/21/2017 18:11:44.054  0
1   a         4     2/21/2017 18:19:03.147  0
1   b         5     2/21/2017 18:19:03.147  0
1   b         6     2/21/2017 21:55:43.927  0
1   b         7     2/21/2017 22:10:29.699  1
1   b         8     2/21/2017 22:10:29.699  1
2   a         9     2/21/2017 23:36:30.239  0
2   a        10     2/21/2017 23:45:40.005  0
2   a        11     2/22/2017 00:05:43.466  1
2   a        12     2/22/2017 00:05:43.466  1
2   b        13     2/22/2017 00:16:00.646  0
2   b        14     2/22/2017 11:43:16.250  1
2   b        15     2/22/2017 11:43:16.250  1
2   b        16     2/22/2017 14:02:10.531  0
Answered By: G.G