Aggregating a feature based on a historical data excluding the current date of the record
Question:
wondering if anyone one here can help or point me in the right direction. Im fairly new to programming so any assistance appreciated
Im currently doing some feature extraction for a project of mine and im trying to create some aggregator features.
One particular feature im trying to create can only take into account historical records excluding the date of the actual record. So far ive been using groupby and cumcount but im struggling to get what i want. Please see below
df['Cum Count'] = df.sort_values('Time').groupby(['ID']).cumcount()
Time
ID
Cum Count
desired result
03/04/2016 15:35
1234567
0
0
05/04/2016 14:40
1234567
1
1
05/04/2016 17:30
1234567
2
1
08/04/2016 17:05
1234567
3
3
08/04/2016 18:10
1234567
4
3
09/04/2016 17:45
1234567
5
5
15/04/2016 17:25
1234567
6
6
15/04/2016 19:55
1234567
7
6
20/04/2016 17:25
1234567
8
8
20/04/2016 19:25
1234567
9
8
22/04/2016 18:10
1234567
10
10
25/04/2016 14:15
1234567
11
11
25/04/2016 14:45
1234567
12
11
27/04/2016 18:40
1234567
13
13
28/04/2016 18:05
1234567
14
14
04/05/2016 14:45
1234567
15
15
04/05/2016 15:15
1234567
16
15
Answers:
Try:
# transform Time to datetime (if necessary):
df['Time'] = pd.to_datetime(df['Time'])
df['desired result 2'] = df.groupby(['ID', df['Time'].dt.date], sort=False)['Cum Count'].transform('first')
print(df)
Prints:
Time ID Cum Count desired result desired result 2
0 2016-03-04 15:35:00 1234567 0 0 0
1 2016-05-04 14:40:00 1234567 1 1 1
2 2016-05-04 17:30:00 1234567 2 1 1
3 2016-08-04 17:05:00 1234567 3 3 3
4 2016-08-04 18:10:00 1234567 4 3 3
5 2016-09-04 17:45:00 1234567 5 5 5
6 2016-04-15 17:25:00 1234567 6 6 6
7 2016-04-15 19:55:00 1234567 7 6 6
8 2016-04-20 17:25:00 1234567 8 8 8
9 2016-04-20 19:25:00 1234567 9 8 8
10 2016-04-22 18:10:00 1234567 10 10 10
11 2016-04-25 14:15:00 1234567 11 11 11
12 2016-04-25 14:45:00 1234567 12 11 11
13 2016-04-27 18:40:00 1234567 13 13 13
14 2016-04-28 18:05:00 1234567 14 14 14
15 2016-04-05 14:45:00 1234567 15 15 15
16 2016-04-05 15:15:00 1234567 16 15 15
If you want just a group number, you can use .ngroup()
:
df['group number'] = df.groupby(['ID', df['Time'].dt.date], sort=False).ngroup()
print(df)
Prints:
Time ID Cum Count desired result group number
0 2016-03-04 15:35:00 1234567 0 0 0
1 2016-05-04 14:40:00 1234567 1 1 1
2 2016-05-04 17:30:00 1234567 2 1 1
3 2016-08-04 17:05:00 1234567 3 3 2
4 2016-08-04 18:10:00 1234567 4 3 2
5 2016-09-04 17:45:00 1234567 5 5 3
6 2016-04-15 17:25:00 1234567 6 6 4
7 2016-04-15 19:55:00 1234567 7 6 4
8 2016-04-20 17:25:00 1234567 8 8 5
9 2016-04-20 19:25:00 1234567 9 8 5
10 2016-04-22 18:10:00 1234567 10 10 6
11 2016-04-25 14:15:00 1234567 11 11 7
12 2016-04-25 14:45:00 1234567 12 11 7
13 2016-04-27 18:40:00 1234567 13 13 8
14 2016-04-28 18:05:00 1234567 14 14 9
15 2016-04-05 14:45:00 1234567 15 15 10
16 2016-04-05 15:15:00 1234567 16 15 10
wondering if anyone one here can help or point me in the right direction. Im fairly new to programming so any assistance appreciated
Im currently doing some feature extraction for a project of mine and im trying to create some aggregator features.
One particular feature im trying to create can only take into account historical records excluding the date of the actual record. So far ive been using groupby and cumcount but im struggling to get what i want. Please see below
df['Cum Count'] = df.sort_values('Time').groupby(['ID']).cumcount()
Time | ID | Cum Count | desired result |
---|---|---|---|
03/04/2016 15:35 | 1234567 | 0 | 0 |
05/04/2016 14:40 | 1234567 | 1 | 1 |
05/04/2016 17:30 | 1234567 | 2 | 1 |
08/04/2016 17:05 | 1234567 | 3 | 3 |
08/04/2016 18:10 | 1234567 | 4 | 3 |
09/04/2016 17:45 | 1234567 | 5 | 5 |
15/04/2016 17:25 | 1234567 | 6 | 6 |
15/04/2016 19:55 | 1234567 | 7 | 6 |
20/04/2016 17:25 | 1234567 | 8 | 8 |
20/04/2016 19:25 | 1234567 | 9 | 8 |
22/04/2016 18:10 | 1234567 | 10 | 10 |
25/04/2016 14:15 | 1234567 | 11 | 11 |
25/04/2016 14:45 | 1234567 | 12 | 11 |
27/04/2016 18:40 | 1234567 | 13 | 13 |
28/04/2016 18:05 | 1234567 | 14 | 14 |
04/05/2016 14:45 | 1234567 | 15 | 15 |
04/05/2016 15:15 | 1234567 | 16 | 15 |
Try:
# transform Time to datetime (if necessary):
df['Time'] = pd.to_datetime(df['Time'])
df['desired result 2'] = df.groupby(['ID', df['Time'].dt.date], sort=False)['Cum Count'].transform('first')
print(df)
Prints:
Time ID Cum Count desired result desired result 2
0 2016-03-04 15:35:00 1234567 0 0 0
1 2016-05-04 14:40:00 1234567 1 1 1
2 2016-05-04 17:30:00 1234567 2 1 1
3 2016-08-04 17:05:00 1234567 3 3 3
4 2016-08-04 18:10:00 1234567 4 3 3
5 2016-09-04 17:45:00 1234567 5 5 5
6 2016-04-15 17:25:00 1234567 6 6 6
7 2016-04-15 19:55:00 1234567 7 6 6
8 2016-04-20 17:25:00 1234567 8 8 8
9 2016-04-20 19:25:00 1234567 9 8 8
10 2016-04-22 18:10:00 1234567 10 10 10
11 2016-04-25 14:15:00 1234567 11 11 11
12 2016-04-25 14:45:00 1234567 12 11 11
13 2016-04-27 18:40:00 1234567 13 13 13
14 2016-04-28 18:05:00 1234567 14 14 14
15 2016-04-05 14:45:00 1234567 15 15 15
16 2016-04-05 15:15:00 1234567 16 15 15
If you want just a group number, you can use .ngroup()
:
df['group number'] = df.groupby(['ID', df['Time'].dt.date], sort=False).ngroup()
print(df)
Prints:
Time ID Cum Count desired result group number
0 2016-03-04 15:35:00 1234567 0 0 0
1 2016-05-04 14:40:00 1234567 1 1 1
2 2016-05-04 17:30:00 1234567 2 1 1
3 2016-08-04 17:05:00 1234567 3 3 2
4 2016-08-04 18:10:00 1234567 4 3 2
5 2016-09-04 17:45:00 1234567 5 5 3
6 2016-04-15 17:25:00 1234567 6 6 4
7 2016-04-15 19:55:00 1234567 7 6 4
8 2016-04-20 17:25:00 1234567 8 8 5
9 2016-04-20 19:25:00 1234567 9 8 5
10 2016-04-22 18:10:00 1234567 10 10 6
11 2016-04-25 14:15:00 1234567 11 11 7
12 2016-04-25 14:45:00 1234567 12 11 7
13 2016-04-27 18:40:00 1234567 13 13 8
14 2016-04-28 18:05:00 1234567 14 14 9
15 2016-04-05 14:45:00 1234567 15 15 10
16 2016-04-05 15:15:00 1234567 16 15 10