Array Averages for "Hour of the Date in Year" in Python

Question:

I have two arrays:

  1. a 3D numpy array with shape (1, 87648, 100) and dtype float64
  2. a 1D array with shape (87648,) and type pandas DatetimeIndex

the values of the 3D array along the axis=1 correspond to the hourly sequence datetimes in the 1D array. Total duration is 10 years with 2 leap years (i.e. 8760 * 8 + 8784 * 2 = 87648). There is no daylight saving so every day has exactly 24 corresponding values.

I would like to calculate the average for the hour of the year across the 10 years worth of data. Meaning, across the 10 years, I want to average all hour 0 of the 1st of Jan, all hour 1 of 1st of Jan, …, such that I have 8784 averages at the end, each being the average over 10 data points except for the 24 hours of Feb 29th, those would be the average over 2 data points each.

Just to clarify more precisely, the desired outcome is a 3D array with shape (1, 8748, 100) and dtype float64.

Let the 3D array be called "volume" and the 1D datetime array "datetime_array", my incomplete last attempt was going in this direction, but I’m really puzzled with this problem:

hour_of_year = np.array([dt.hour + (dt.dayofyear - 1) * 24 for dt in datetime_array])
volume_by_hour = np.reshape(volume, (volume.shape[0], volume.shape[1] / 24, volume.shape[2], 24))
profile = np.array([np.mean(group, axis=0) for i, group in np.ndenumerate(volume)]).reshape(???)

The problem here in the first line already is that it doesn’t distinguish between the dates. So the hour 1417 to 1440 in a regular year corresponds to 1st March, whereas that is 29th Feb in a leap year.

If the leap year distinction makes it significantly more complicated, it is not that important and can be neglected.

Asked By: Dattel Klauber

||

Answers:

Given that you’re using a pd.DatetimeIndex, you might find pandas operations more useful in this case than using numpy only. Here is an attempt:

import numpy as np
import pandas as pd

volume = np.random.rand(1, 87648, 100)
index = pd.date_range("2013-01-01", "2023-01-01", freq="H", inclusive="left")

df = pd.DataFrame(
    volume.squeeze(), # Squeeze to temporarily get rid of the leading single dimension
    index=index
)

out = df.groupby(df.index.strftime("%m-%d %H")).mean()

Here I’m using pd.DatetimeIndex.strftime as a way to uniquely identify the rows that you want to be grouped together when taking the mean, but you could also use [df.index.month, df.index.day, df.index.hour]

The output looks like:

                0         1         2         3         4   ...        95        96        97        98        99
01-01 00  0.352494  0.616882  0.475246  0.543492  0.482271  ...  0.431965  0.292609  0.593101  0.465737  0.515728
01-01 01  0.602057  0.503248  0.496831  0.561276  0.476792  ...  0.446117  0.420354  0.494491  0.433746  0.588248
01-01 02  0.574717  0.474213  0.558099  0.598167  0.512984  ...  0.511152  0.438548  0.464368  0.598788  0.478550
01-01 03  0.380682  0.680109  0.662305  0.498367  0.659267  ...  0.537061  0.617603  0.545073  0.527590  0.599664
01-01 04  0.616761  0.456948  0.700690  0.564529  0.495705  ...  0.648317  0.393420  0.479093  0.512675  0.323712
...            ...       ...       ...       ...       ...  ...       ...       ...       ...       ...       ...
12-31 19  0.373228  0.471034  0.506665  0.444749  0.460461  ...  0.558895  0.538552  0.389275  0.418527  0.508002
12-31 20  0.435194  0.454427  0.506929  0.431770  0.391848  ...  0.363227  0.558908  0.607851  0.494579  0.473551
12-31 21  0.526382  0.558862  0.560605  0.357882  0.319049  ...  0.568854  0.443583  0.421765  0.475142  0.480418
12-31 22  0.628438  0.367111  0.629999  0.501194  0.499882  ...  0.391688  0.274963  0.417083  0.433642  0.554901
12-31 23  0.511908  0.570115  0.379889  0.492934  0.572257  ...  0.538664  0.675786  0.477229  0.535941  0.518781

[8784 rows x 100 columns]

You can get it back as a numpy array with a leading singleton dimension:

out = out.to_numpy()[None]
Answered By: Chrysophylaxs