Group by and find consecutive time and create a flag in Python
Question:
The following is the data I am having,
id name unused time
1 a 1 2/21/2017 18:01:31.168
1 a 2 2/21/2017 18:01:31.168
1 a 3 2/21/2017 18:11:44.054
1 a 4 2/21/2017 18:19:03.147
1 b 5 2/21/2017 18:19:03.147
1 b 6 2/21/2017 21:55:43.927
1 b 7 2/21/2017 22:10:29.699
1 b 8 2/21/2017 22:10:29.699
2 a 9 2/21/2017 23:36:30.239
2 a 10 2/21/2017 23:45:40.005
2 a 11 2/22/2017 00:05:43.466
2 a 12 2/22/2017 00:05:43.466
2 b 13 2/22/2017 00:16:00.646
2 b 14 2/22/2017 11:43:16.250
2 b 15 2/22/2017 11:43:16.250
2 b 16 2/22/2017 14:02:10.531
I want to group it with id, name and look for consecutive time stamps and create a flag for it. For example, the 1st and the 2nd row have same id, name and time so I want 1 for both the values where if it is not consecutive, it should be 0.
The following is the output I am trying to achieve,
id name unused time flag
1 a 1 2/21/2017 18:01:31.168 1
1 a 2 2/21/2017 18:01:31.168 1
1 a 3 2/21/2017 18:11:44.054 0
1 a 4 2/21/2017 18:19:03.147 0
1 b 5 2/21/2017 18:19:03.147 0
1 b 6 2/21/2017 21:55:43.927 0
1 b 7 2/21/2017 22:10:29.699 1
1 b 8 2/21/2017 22:10:29.699 1
2 a 9 2/21/2017 23:36:30.239 0
2 a 10 2/21/2017 23:45:40.005 0
2 a 11 2/22/2017 00:05:43.466 1
2 a 12 2/22/2017 00:05:43.466 1
2 b 13 2/22/2017 00:16:00.646 0
2 b 14 2/22/2017 11:43:16.250 1
2 b 15 2/22/2017 11:43:16.250 1
2 b 16 2/22/2017 14:02:10.531 0
The following is my trying,
I am trying this for sorting it,
data.sort_values(['id', 'name', 'time'])
Then I want to group it,
data.sort_values(['id', 'name', 'time']).groupby(['id', 'name'])
But I am not able to create the flag after that. I am thinking of a solution where I can write a for loop and loop through all the values and check for the condition. But I am thinking there should be a efficient solution because I need to find it for million rows.
Can anybody help me in solving this?
Thanks
Answers:
One approach may be to just use shift
to compare one forward and one behind with your columns of interest.
eval_cols = df[['id', 'name', 'time']]
df['flag'] = ((eval_cols == eval_cols.shift()).all(1) |
(eval_cols == eval_cols.shift(-1)).all(1)).astype(int)
Demo
>>> ((eval_cols == eval_cols.shift()).all(1) |
(eval_cols == eval_cols.shift(-1)).all(1)).astype(int)
0 1
1 1
2 0
3 0
4 0
5 0
6 1
7 1
8 0
9 0
10 1
11 1
12 0
13 1
14 1
15 0
dtype: int32
col1=df1[['id','name','time']].astype(str).apply('*'.join,1)
col2=col1.ne(col1.shift()).cumsum()
df1.assign(flag=col2).groupby(col2).apply(lambda dd:dd.assign(flag=1) if len(dd)>=2 else dd.assign(flag=0))
out:
id name unused time flag
1 a 1 2/21/2017 18:01:31.168 1
1 a 2 2/21/2017 18:01:31.168 1
1 a 3 2/21/2017 18:11:44.054 0
1 a 4 2/21/2017 18:19:03.147 0
1 b 5 2/21/2017 18:19:03.147 0
1 b 6 2/21/2017 21:55:43.927 0
1 b 7 2/21/2017 22:10:29.699 1
1 b 8 2/21/2017 22:10:29.699 1
2 a 9 2/21/2017 23:36:30.239 0
2 a 10 2/21/2017 23:45:40.005 0
2 a 11 2/22/2017 00:05:43.466 1
2 a 12 2/22/2017 00:05:43.466 1
2 b 13 2/22/2017 00:16:00.646 0
2 b 14 2/22/2017 11:43:16.250 1
2 b 15 2/22/2017 11:43:16.250 1
2 b 16 2/22/2017 14:02:10.531 0
The following is the data I am having,
id name unused time
1 a 1 2/21/2017 18:01:31.168
1 a 2 2/21/2017 18:01:31.168
1 a 3 2/21/2017 18:11:44.054
1 a 4 2/21/2017 18:19:03.147
1 b 5 2/21/2017 18:19:03.147
1 b 6 2/21/2017 21:55:43.927
1 b 7 2/21/2017 22:10:29.699
1 b 8 2/21/2017 22:10:29.699
2 a 9 2/21/2017 23:36:30.239
2 a 10 2/21/2017 23:45:40.005
2 a 11 2/22/2017 00:05:43.466
2 a 12 2/22/2017 00:05:43.466
2 b 13 2/22/2017 00:16:00.646
2 b 14 2/22/2017 11:43:16.250
2 b 15 2/22/2017 11:43:16.250
2 b 16 2/22/2017 14:02:10.531
I want to group it with id, name and look for consecutive time stamps and create a flag for it. For example, the 1st and the 2nd row have same id, name and time so I want 1 for both the values where if it is not consecutive, it should be 0.
The following is the output I am trying to achieve,
id name unused time flag
1 a 1 2/21/2017 18:01:31.168 1
1 a 2 2/21/2017 18:01:31.168 1
1 a 3 2/21/2017 18:11:44.054 0
1 a 4 2/21/2017 18:19:03.147 0
1 b 5 2/21/2017 18:19:03.147 0
1 b 6 2/21/2017 21:55:43.927 0
1 b 7 2/21/2017 22:10:29.699 1
1 b 8 2/21/2017 22:10:29.699 1
2 a 9 2/21/2017 23:36:30.239 0
2 a 10 2/21/2017 23:45:40.005 0
2 a 11 2/22/2017 00:05:43.466 1
2 a 12 2/22/2017 00:05:43.466 1
2 b 13 2/22/2017 00:16:00.646 0
2 b 14 2/22/2017 11:43:16.250 1
2 b 15 2/22/2017 11:43:16.250 1
2 b 16 2/22/2017 14:02:10.531 0
The following is my trying,
I am trying this for sorting it,
data.sort_values(['id', 'name', 'time'])
Then I want to group it,
data.sort_values(['id', 'name', 'time']).groupby(['id', 'name'])
But I am not able to create the flag after that. I am thinking of a solution where I can write a for loop and loop through all the values and check for the condition. But I am thinking there should be a efficient solution because I need to find it for million rows.
Can anybody help me in solving this?
Thanks
One approach may be to just use shift
to compare one forward and one behind with your columns of interest.
eval_cols = df[['id', 'name', 'time']]
df['flag'] = ((eval_cols == eval_cols.shift()).all(1) |
(eval_cols == eval_cols.shift(-1)).all(1)).astype(int)
Demo
>>> ((eval_cols == eval_cols.shift()).all(1) |
(eval_cols == eval_cols.shift(-1)).all(1)).astype(int)
0 1
1 1
2 0
3 0
4 0
5 0
6 1
7 1
8 0
9 0
10 1
11 1
12 0
13 1
14 1
15 0
dtype: int32
col1=df1[['id','name','time']].astype(str).apply('*'.join,1)
col2=col1.ne(col1.shift()).cumsum()
df1.assign(flag=col2).groupby(col2).apply(lambda dd:dd.assign(flag=1) if len(dd)>=2 else dd.assign(flag=0))
out:
id name unused time flag
1 a 1 2/21/2017 18:01:31.168 1
1 a 2 2/21/2017 18:01:31.168 1
1 a 3 2/21/2017 18:11:44.054 0
1 a 4 2/21/2017 18:19:03.147 0
1 b 5 2/21/2017 18:19:03.147 0
1 b 6 2/21/2017 21:55:43.927 0
1 b 7 2/21/2017 22:10:29.699 1
1 b 8 2/21/2017 22:10:29.699 1
2 a 9 2/21/2017 23:36:30.239 0
2 a 10 2/21/2017 23:45:40.005 0
2 a 11 2/22/2017 00:05:43.466 1
2 a 12 2/22/2017 00:05:43.466 1
2 b 13 2/22/2017 00:16:00.646 0
2 b 14 2/22/2017 11:43:16.250 1
2 b 15 2/22/2017 11:43:16.250 1
2 b 16 2/22/2017 14:02:10.531 0