Aggregating a feature based on a historical data excluding the current date of the record

Question:

wondering if anyone one here can help or point me in the right direction. Im fairly new to programming so any assistance appreciated

Im currently doing some feature extraction for a project of mine and im trying to create some aggregator features.

One particular feature im trying to create can only take into account historical records excluding the date of the actual record. So far ive been using groupby and cumcount but im struggling to get what i want. Please see below

df['Cum Count'] = df.sort_values('Time').groupby(['ID']).cumcount()
Time ID Cum Count desired result
03/04/2016 15:35 1234567 0 0
05/04/2016 14:40 1234567 1 1
05/04/2016 17:30 1234567 2 1
08/04/2016 17:05 1234567 3 3
08/04/2016 18:10 1234567 4 3
09/04/2016 17:45 1234567 5 5
15/04/2016 17:25 1234567 6 6
15/04/2016 19:55 1234567 7 6
20/04/2016 17:25 1234567 8 8
20/04/2016 19:25 1234567 9 8
22/04/2016 18:10 1234567 10 10
25/04/2016 14:15 1234567 11 11
25/04/2016 14:45 1234567 12 11
27/04/2016 18:40 1234567 13 13
28/04/2016 18:05 1234567 14 14
04/05/2016 14:45 1234567 15 15
04/05/2016 15:15 1234567 16 15
Asked By: Ger Gleeson

||

Answers:

Try:

# transform Time to datetime (if necessary):
df['Time'] = pd.to_datetime(df['Time'])

df['desired result 2'] = df.groupby(['ID', df['Time'].dt.date], sort=False)['Cum Count'].transform('first')

print(df)

Prints:

                  Time       ID  Cum Count  desired result  desired result 2
0  2016-03-04 15:35:00  1234567          0               0                 0
1  2016-05-04 14:40:00  1234567          1               1                 1
2  2016-05-04 17:30:00  1234567          2               1                 1
3  2016-08-04 17:05:00  1234567          3               3                 3
4  2016-08-04 18:10:00  1234567          4               3                 3
5  2016-09-04 17:45:00  1234567          5               5                 5
6  2016-04-15 17:25:00  1234567          6               6                 6
7  2016-04-15 19:55:00  1234567          7               6                 6
8  2016-04-20 17:25:00  1234567          8               8                 8
9  2016-04-20 19:25:00  1234567          9               8                 8
10 2016-04-22 18:10:00  1234567         10              10                10
11 2016-04-25 14:15:00  1234567         11              11                11
12 2016-04-25 14:45:00  1234567         12              11                11
13 2016-04-27 18:40:00  1234567         13              13                13
14 2016-04-28 18:05:00  1234567         14              14                14
15 2016-04-05 14:45:00  1234567         15              15                15
16 2016-04-05 15:15:00  1234567         16              15                15

If you want just a group number, you can use .ngroup():

df['group number'] = df.groupby(['ID', df['Time'].dt.date], sort=False).ngroup()
print(df)

Prints:

                  Time       ID  Cum Count  desired result  group number
0  2016-03-04 15:35:00  1234567          0               0             0
1  2016-05-04 14:40:00  1234567          1               1             1
2  2016-05-04 17:30:00  1234567          2               1             1
3  2016-08-04 17:05:00  1234567          3               3             2
4  2016-08-04 18:10:00  1234567          4               3             2
5  2016-09-04 17:45:00  1234567          5               5             3
6  2016-04-15 17:25:00  1234567          6               6             4
7  2016-04-15 19:55:00  1234567          7               6             4
8  2016-04-20 17:25:00  1234567          8               8             5
9  2016-04-20 19:25:00  1234567          9               8             5
10 2016-04-22 18:10:00  1234567         10              10             6
11 2016-04-25 14:15:00  1234567         11              11             7
12 2016-04-25 14:45:00  1234567         12              11             7
13 2016-04-27 18:40:00  1234567         13              13             8
14 2016-04-28 18:05:00  1234567         14              14             9
15 2016-04-05 14:45:00  1234567         15              15            10
16 2016-04-05 15:15:00  1234567         16              15            10
Answered By: Andrej Kesely