Subtract dates on different lags

Question:

I have a dataset with two columns, Date1 and Date2. My goal is to subtract date2 from date1 at lag 0, lag 1, lag 2, …, lag 6. If date2 is bigger, then flag it as 1 otherwise 0. I need to do this iteratively so

date2[0] – date1[0] date1[1] … date1[5]

date2[1] – date1[1] date1[2] … date1[6]

etc.

                Date1               Date2
0  2021-06-15 00:25:29 2021-06-15 10:05:50
1  2021-06-15 13:32:01 2021-06-15 14:17:30
2  2021-06-15 17:59:37 2021-06-15 18:12:30
3  2021-06-17 01:01:16 2021-06-17 13:30:23
4  2021-06-17 14:07:11 2021-06-17 14:34:45
5  2021-06-17 18:30:24 2021-06-17 19:22:02
6  2021-06-17 19:42:28 2021-06-18 10:11:04
7  2021-06-18 12:54:50 2021-06-18 13:25:16
8  2021-06-18 16:59:40 2021-06-18 17:22:23
9  2021-06-18 17:49:54 2021-06-18 18:25:53
10 2021-06-18 19:57:39 2021-06-18 20:43:11
11 2021-06-21 13:52:28 2021-06-21 14:03:31
12 2021-06-21 15:44:46 2021-06-21 18:31:21
13 2021-06-21 20:03:37 2021-06-21 20:59:54
14 2021-06-22 18:39:22 2021-06-22 19:23:28
15 2021-06-23 19:45:54 2021-06-23 19:52:26
16 2021-06-23 19:59:33 2021-06-23 20:00:43
17 2021-06-24 12:53:31 2021-06-25 13:25:30
18 2021-06-25 17:57:05 2021-06-25 19:32:37
19 2021-06-28 13:34:25 2021-06-28 14:00:04

DATA

{‘Date1’: {0: Timestamp(‘2021-06-15 00:25:29’), 1: Timestamp(‘2021-06-15 13:32:01’), 2: Timestamp(‘2021-06-15 17:59:37’), 3: Timestamp(‘2021-06-17 01:01:16’), 4: Timestamp(‘2021-06-17 14:07:11’), 5: Timestamp(‘2021-06-17 18:30:24’), 6: Timestamp(‘2021-06-17 19:42:28’), 7: Timestamp(‘2021-06-18 12:54:50’), 8: Timestamp(‘2021-06-18 16:59:40’), 9: Timestamp(‘2021-06-18 17:49:54’), 10: Timestamp(‘2021-06-18 19:57:39’), 11: Timestamp(‘2021-06-21 13:52:28’), 12: Timestamp(‘2021-06-21 15:44:46’), 13: Timestamp(‘2021-06-21 20:03:37’), 14: Timestamp(‘2021-06-22 18:39:22’), 15: Timestamp(‘2021-06-23 19:45:54’), 16: Timestamp(‘2021-06-23 19:59:33’), 17: Timestamp(‘2021-06-24 12:53:31’), 18: Timestamp(‘2021-06-25 17:57:05’), 19: Timestamp(‘2021-06-28 13:34:25’)}, ‘Date2’: {0: Timestamp(‘2021-06-15 10:05:50’), 1: Timestamp(‘2021-06-15 14:17:30’), 2: Timestamp(‘2021-06-15 18:12:30’), 3: Timestamp(‘2021-06-17 13:30:23’), 4: Timestamp(‘2021-06-17 14:34:45’), 5: Timestamp(‘2021-06-17 19:22:02’), 6: Timestamp(‘2021-06-18 10:11:04’), 7: Timestamp(‘2021-06-18 13:25:16’), 8: Timestamp(‘2021-06-18 17:22:23’), 9: Timestamp(‘2021-06-18 18:25:53’), 10: Timestamp(‘2021-06-18 20:43:11’), 11: Timestamp(‘2021-06-21 14:03:31’), 12: Timestamp(‘2021-06-21 18:31:21’), 13: Timestamp(‘2021-06-21 20:59:54’), 14: Timestamp(‘2021-06-22 19:23:28’), 15: Timestamp(‘2021-06-23 19:52:26’), 16: Timestamp(‘2021-06-23 20:00:43’), 17: Timestamp(‘2021-06-25 13:25:30’), 18: Timestamp(‘2021-06-25 19:32:37’), 19: Timestamp(‘2021-06-28 14:00:04’)}}

Asked By: Beck

||

Answers:

Setup

print(df)

                Date1               Date2
0 2021-06-14 00:25:29 2021-06-15 10:05:50
1 2021-06-09 13:32:01 2021-06-15 14:17:30
2 2021-06-12 17:59:37 2021-06-15 18:12:30
3 2021-06-17 01:01:16 2021-06-17 13:30:23
4 2021-06-14 14:07:11 2021-06-17 14:34:45
5 2021-06-12 18:30:24 2021-06-17 19:22:02
6 2021-06-11 19:42:28 2021-06-18 10:11:04
7 2021-06-17 12:54:50 2021-06-18 13:25:16
8 2021-06-18 16:59:40 2021-06-18 17:22:23
9 2021-06-15 17:49:54 2021-06-18 18:25:53

Numpy solution

x = df['Date1'].to_numpy()
y = df['Date2'].to_numpy()

m = y[:, None] >= x
df['count'] = (np.triu(m) & ~np.triu(m, 6)).sum(1)

Result

                Date1               Date2  count
0 2021-06-14 00:25:29 2021-06-15 10:05:50      5
1 2021-06-09 13:32:01 2021-06-15 14:17:30      5
2 2021-06-12 17:59:37 2021-06-15 18:12:30      4
3 2021-06-17 01:01:16 2021-06-17 13:30:23      5
4 2021-06-14 14:07:11 2021-06-17 14:34:45      5
5 2021-06-12 18:30:24 2021-06-17 19:22:02      4
6 2021-06-11 19:42:28 2021-06-18 10:11:04      3
7 2021-06-17 12:54:50 2021-06-18 13:25:16      2
8 2021-06-18 16:59:40 2021-06-18 17:22:23      2
9 2021-06-15 17:49:54 2021-06-18 18:25:53      1

Some details

  • y[:, None] >= x creates a boolean mask by comparing each value in Date2 to every value in Date1
  • np.triu(arr, k=0) is used to select an array with the elements below the k-th diagonal zeroed.
Answered By: Shubham Sharma