computing the mean for python datetime
Question:
I have a datetime attribute:
d = {
'DOB': pd.Series([
datetime.datetime(2014, 7, 9),
datetime.datetime(2014, 7, 15),
np.datetime64('NaT')
], index=['a', 'b', 'c'])
}
df_test = pd.DataFrame(d)
I would like to compute the mean for that attribute. Running mean() causes an error:
TypeError: reduction operation ‘mean’ not allowed for this dtype
I also tried the solution proposed elsewhere. It doesn’t work as running the function proposed there causes
OverflowError: Python int too large to convert to C long
What would you propose? The result for the above dataframe should be equivalent to
datetime.datetime(2014, 7, 12).
Answers:
Datetime math supports some standard operations:
a = datetime.datetime(2014, 7, 9)
b = datetime.datetime(2014, 7, 15)
c = (b - a)/2
# here c will be datetime.timedelta(3)
a + c
Out[7]: datetime.datetime(2014, 7, 12, 0, 0)
So you can write a function that, given two datetimes, subtracts the lesser form the greater and adds half of the difference to the lesser. Apply this function to your dataframe, and shazam!
You could work with unix
time if you want. This is defined as the total number of seconds (for instance) since 1970-01-01
. With that, all of your times are simply floats, so it’s very easy to do simple math on the columns.
import pandas as pd
df_test['unix_time'] = (df_test.DOB - pd.to_datetime('1970-01-01')).dt.total_seconds()
df_test['unix_time'].mean()
#1405123200.0
# You want it in date, so just convert back
pd.to_datetime(df_test['unix_time'].mean(), origin='unix', unit='s')
#Timestamp('2014-07-12 00:00:00')
You can take the mean of Timedelta
. So find the minimum value and subtract it from the series to get a series of Timedelta
. Then take the mean and add it back to the minimum.
dob = df_test.DOB
m = dob.min()
(m + (dob - m).mean()).to_pydatetime()
datetime.datetime(2014, 7, 12, 0, 0)
One-line
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(d.min())).to_pydatetime()
I use the epoch pd.Timestamp(0)
instead of min
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(pd.Timestamp(0))).to_pydatetime()
You can convert epoch time using astype
with np.int64 and converting back to datetime with pd.to_datetime
:
pd.to_datetime(df_test.DOB.dropna().astype(np.int64).mean())
Output:
Timestamp('2014-07-12 00:00:00')
As of pandas=0.25, it is possible to compute the mean of a datetime series.
In [1]: import pandas as pd
...: import numpy as np
In [2]: s = pd.Series([
...: pd.datetime(2014, 7, 9),
...: pd.datetime(2014, 7, 15),
...: np.datetime64('NaT')])
In [3]: s.mean()
Out[3]: Timestamp('2014-07-12 00:00:00')
However, note that applying mean to a pandas dataframe currently ignores columns with a datetime series.
I have a datetime attribute:
d = {
'DOB': pd.Series([
datetime.datetime(2014, 7, 9),
datetime.datetime(2014, 7, 15),
np.datetime64('NaT')
], index=['a', 'b', 'c'])
}
df_test = pd.DataFrame(d)
I would like to compute the mean for that attribute. Running mean() causes an error:
TypeError: reduction operation ‘mean’ not allowed for this dtype
I also tried the solution proposed elsewhere. It doesn’t work as running the function proposed there causes
OverflowError: Python int too large to convert to C long
What would you propose? The result for the above dataframe should be equivalent to
datetime.datetime(2014, 7, 12).
Datetime math supports some standard operations:
a = datetime.datetime(2014, 7, 9)
b = datetime.datetime(2014, 7, 15)
c = (b - a)/2
# here c will be datetime.timedelta(3)
a + c
Out[7]: datetime.datetime(2014, 7, 12, 0, 0)
So you can write a function that, given two datetimes, subtracts the lesser form the greater and adds half of the difference to the lesser. Apply this function to your dataframe, and shazam!
You could work with unix
time if you want. This is defined as the total number of seconds (for instance) since 1970-01-01
. With that, all of your times are simply floats, so it’s very easy to do simple math on the columns.
import pandas as pd
df_test['unix_time'] = (df_test.DOB - pd.to_datetime('1970-01-01')).dt.total_seconds()
df_test['unix_time'].mean()
#1405123200.0
# You want it in date, so just convert back
pd.to_datetime(df_test['unix_time'].mean(), origin='unix', unit='s')
#Timestamp('2014-07-12 00:00:00')
You can take the mean of Timedelta
. So find the minimum value and subtract it from the series to get a series of Timedelta
. Then take the mean and add it back to the minimum.
dob = df_test.DOB
m = dob.min()
(m + (dob - m).mean()).to_pydatetime()
datetime.datetime(2014, 7, 12, 0, 0)
One-line
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(d.min())).to_pydatetime()
I use the epoch pd.Timestamp(0)
instead of min
df_test.DOB.pipe(lambda d: (lambda m: m + (d - m).mean())(pd.Timestamp(0))).to_pydatetime()
You can convert epoch time using astype
with np.int64 and converting back to datetime with pd.to_datetime
:
pd.to_datetime(df_test.DOB.dropna().astype(np.int64).mean())
Output:
Timestamp('2014-07-12 00:00:00')
As of pandas=0.25, it is possible to compute the mean of a datetime series.
In [1]: import pandas as pd
...: import numpy as np
In [2]: s = pd.Series([
...: pd.datetime(2014, 7, 9),
...: pd.datetime(2014, 7, 15),
...: np.datetime64('NaT')])
In [3]: s.mean()
Out[3]: Timestamp('2014-07-12 00:00:00')
However, note that applying mean to a pandas dataframe currently ignores columns with a datetime series.