Aggregate hourly data and count duplicates in pandas python


I have a dataframe df that looks like this:

time Hit/Miss
2016-09-29 08:00:00 FN
2016-09-29 08:30:00 FN
2016-09-29 09:45:00 TP
2016-10-05 14:00:00 FP

time is imported straight from a file without being set as an index, and Hit/Miss contains three categorical values FN, TP, and FP.

Now, I’d like to group the data by hour, day, and month and count how many FN, TP, FP‘s occurred during the hour. As you can see from the brief example dataset. Now, every hour has Hit/Miss values.

This is what I have so far after grouping by day, which does give me the desired output:

df['time'] = pd.to_datetime(df['time'])
df['year'] = df['time'].dt.year
df['month'] = df['time'].dt.month
df['day'] = df['time']

HM_DD = pd.crosstab([df['year'],df['month'],df['day']], df['Hit/Miss'], dropna=True)
HM_MM = pd.crosstab([df['year'],df['month']], df['Hit/Miss'], dropna=True)
HM_YYMM = pd.crosstab(df['year'], df['Hit/Miss'], dropna=True)

HM_DD = HM_DD.reset_index()

And the output looks like the snip shot desired output here:

It does show what I need to see; however, I feel like there’s a better way to do it. After all, I need to visualize the data by day and year (extra qn if you have time: What would be the best way/chart to visualize it?), so I’d like to keep the time intact for better maneuver later on, not like how I separated them into the year, month, day. I have tried setting time as an index and resample(), but that didn’t work for me.

I would appreciate the help!

Asked By: confused student



You can use offsets to aggregate your data:

HM_DD = pd.crosstab(df['time'].dt.normalize(), df['Hit/Miss'])
HM_MM = pd.crosstab(df['time'], df['Hit/Miss'])
HM_YYMM = pd.crosstab(df['time'], df['Hit/Miss'])


>>> HM_DD.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-09-29   2   0   1
1 2016-10-05   0   1   0

>>> HM_MM.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-09-01   2   0   1
1 2016-10-01   0   1   0

>>> HM_YYMM.reset_index().rename_axis(columns=None)
        time  FN  FP  TP
0 2016-01-01   2   1   1
Answered By: Corralien