Array Averages for "Hour of the Date in Year" in Python
Question:
I have two arrays:
- a 3D numpy array with shape (1, 87648, 100) and dtype float64
- a 1D array with shape (87648,) and type pandas DatetimeIndex
the values of the 3D array along the axis=1 correspond to the hourly sequence datetimes in the 1D array. Total duration is 10 years with 2 leap years (i.e. 8760 * 8 + 8784 * 2 = 87648). There is no daylight saving so every day has exactly 24 corresponding values.
I would like to calculate the average for the hour of the year across the 10 years worth of data. Meaning, across the 10 years, I want to average all hour 0 of the 1st of Jan, all hour 1 of 1st of Jan, …, such that I have 8784 averages at the end, each being the average over 10 data points except for the 24 hours of Feb 29th, those would be the average over 2 data points each.
Just to clarify more precisely, the desired outcome is a 3D array with shape (1, 8748, 100) and dtype float64.
Let the 3D array be called "volume" and the 1D datetime array "datetime_array", my incomplete last attempt was going in this direction, but I’m really puzzled with this problem:
hour_of_year = np.array([dt.hour + (dt.dayofyear - 1) * 24 for dt in datetime_array])
volume_by_hour = np.reshape(volume, (volume.shape[0], volume.shape[1] / 24, volume.shape[2], 24))
profile = np.array([np.mean(group, axis=0) for i, group in np.ndenumerate(volume)]).reshape(???)
The problem here in the first line already is that it doesn’t distinguish between the dates. So the hour 1417 to 1440 in a regular year corresponds to 1st March, whereas that is 29th Feb in a leap year.
If the leap year distinction makes it significantly more complicated, it is not that important and can be neglected.
Answers:
Given that you’re using a pd.DatetimeIndex
, you might find pandas operations more useful in this case than using numpy only. Here is an attempt:
import numpy as np
import pandas as pd
volume = np.random.rand(1, 87648, 100)
index = pd.date_range("2013-01-01", "2023-01-01", freq="H", inclusive="left")
df = pd.DataFrame(
volume.squeeze(), # Squeeze to temporarily get rid of the leading single dimension
index=index
)
out = df.groupby(df.index.strftime("%m-%d %H")).mean()
Here I’m using pd.DatetimeIndex.strftime
as a way to uniquely identify the rows that you want to be grouped together when taking the mean, but you could also use [df.index.month, df.index.day, df.index.hour]
The output looks like:
0 1 2 3 4 ... 95 96 97 98 99
01-01 00 0.352494 0.616882 0.475246 0.543492 0.482271 ... 0.431965 0.292609 0.593101 0.465737 0.515728
01-01 01 0.602057 0.503248 0.496831 0.561276 0.476792 ... 0.446117 0.420354 0.494491 0.433746 0.588248
01-01 02 0.574717 0.474213 0.558099 0.598167 0.512984 ... 0.511152 0.438548 0.464368 0.598788 0.478550
01-01 03 0.380682 0.680109 0.662305 0.498367 0.659267 ... 0.537061 0.617603 0.545073 0.527590 0.599664
01-01 04 0.616761 0.456948 0.700690 0.564529 0.495705 ... 0.648317 0.393420 0.479093 0.512675 0.323712
... ... ... ... ... ... ... ... ... ... ... ...
12-31 19 0.373228 0.471034 0.506665 0.444749 0.460461 ... 0.558895 0.538552 0.389275 0.418527 0.508002
12-31 20 0.435194 0.454427 0.506929 0.431770 0.391848 ... 0.363227 0.558908 0.607851 0.494579 0.473551
12-31 21 0.526382 0.558862 0.560605 0.357882 0.319049 ... 0.568854 0.443583 0.421765 0.475142 0.480418
12-31 22 0.628438 0.367111 0.629999 0.501194 0.499882 ... 0.391688 0.274963 0.417083 0.433642 0.554901
12-31 23 0.511908 0.570115 0.379889 0.492934 0.572257 ... 0.538664 0.675786 0.477229 0.535941 0.518781
[8784 rows x 100 columns]
You can get it back as a numpy array with a leading singleton dimension:
out = out.to_numpy()[None]
I have two arrays:
- a 3D numpy array with shape (1, 87648, 100) and dtype float64
- a 1D array with shape (87648,) and type pandas DatetimeIndex
the values of the 3D array along the axis=1 correspond to the hourly sequence datetimes in the 1D array. Total duration is 10 years with 2 leap years (i.e. 8760 * 8 + 8784 * 2 = 87648). There is no daylight saving so every day has exactly 24 corresponding values.
I would like to calculate the average for the hour of the year across the 10 years worth of data. Meaning, across the 10 years, I want to average all hour 0 of the 1st of Jan, all hour 1 of 1st of Jan, …, such that I have 8784 averages at the end, each being the average over 10 data points except for the 24 hours of Feb 29th, those would be the average over 2 data points each.
Just to clarify more precisely, the desired outcome is a 3D array with shape (1, 8748, 100) and dtype float64.
Let the 3D array be called "volume" and the 1D datetime array "datetime_array", my incomplete last attempt was going in this direction, but I’m really puzzled with this problem:
hour_of_year = np.array([dt.hour + (dt.dayofyear - 1) * 24 for dt in datetime_array])
volume_by_hour = np.reshape(volume, (volume.shape[0], volume.shape[1] / 24, volume.shape[2], 24))
profile = np.array([np.mean(group, axis=0) for i, group in np.ndenumerate(volume)]).reshape(???)
The problem here in the first line already is that it doesn’t distinguish between the dates. So the hour 1417 to 1440 in a regular year corresponds to 1st March, whereas that is 29th Feb in a leap year.
If the leap year distinction makes it significantly more complicated, it is not that important and can be neglected.
Given that you’re using a pd.DatetimeIndex
, you might find pandas operations more useful in this case than using numpy only. Here is an attempt:
import numpy as np
import pandas as pd
volume = np.random.rand(1, 87648, 100)
index = pd.date_range("2013-01-01", "2023-01-01", freq="H", inclusive="left")
df = pd.DataFrame(
volume.squeeze(), # Squeeze to temporarily get rid of the leading single dimension
index=index
)
out = df.groupby(df.index.strftime("%m-%d %H")).mean()
Here I’m using pd.DatetimeIndex.strftime
as a way to uniquely identify the rows that you want to be grouped together when taking the mean, but you could also use [df.index.month, df.index.day, df.index.hour]
The output looks like:
0 1 2 3 4 ... 95 96 97 98 99
01-01 00 0.352494 0.616882 0.475246 0.543492 0.482271 ... 0.431965 0.292609 0.593101 0.465737 0.515728
01-01 01 0.602057 0.503248 0.496831 0.561276 0.476792 ... 0.446117 0.420354 0.494491 0.433746 0.588248
01-01 02 0.574717 0.474213 0.558099 0.598167 0.512984 ... 0.511152 0.438548 0.464368 0.598788 0.478550
01-01 03 0.380682 0.680109 0.662305 0.498367 0.659267 ... 0.537061 0.617603 0.545073 0.527590 0.599664
01-01 04 0.616761 0.456948 0.700690 0.564529 0.495705 ... 0.648317 0.393420 0.479093 0.512675 0.323712
... ... ... ... ... ... ... ... ... ... ... ...
12-31 19 0.373228 0.471034 0.506665 0.444749 0.460461 ... 0.558895 0.538552 0.389275 0.418527 0.508002
12-31 20 0.435194 0.454427 0.506929 0.431770 0.391848 ... 0.363227 0.558908 0.607851 0.494579 0.473551
12-31 21 0.526382 0.558862 0.560605 0.357882 0.319049 ... 0.568854 0.443583 0.421765 0.475142 0.480418
12-31 22 0.628438 0.367111 0.629999 0.501194 0.499882 ... 0.391688 0.274963 0.417083 0.433642 0.554901
12-31 23 0.511908 0.570115 0.379889 0.492934 0.572257 ... 0.538664 0.675786 0.477229 0.535941 0.518781
[8784 rows x 100 columns]
You can get it back as a numpy array with a leading singleton dimension:
out = out.to_numpy()[None]