python pandas extract unique dates from time series
Question:
I have a DataFrame which contains a lot of intraday data, the DataFrame has several days of data, dates are not continuous.
2012-10-08 07:12:22 0.0 0 0 2315.6 0 0.0 0
2012-10-08 09:14:00 2306.4 20 326586240 2306.4 472 2306.8 4
2012-10-08 09:15:00 2306.8 34 249805440 2306.8 361 2308.0 26
2012-10-08 09:15:01 2308.0 1 53309040 2307.4 77 2308.6 9
2012-10-08 09:15:01.500000 2308.2 1 124630140 2307.0 180 2308.4 1
2012-10-08 09:15:02 2307.0 5 85846260 2308.2 124 2308.0 9
2012-10-08 09:15:02.500000 2307.0 3 128073540 2307.0 185 2307.6 11
......
2012-10-10 07:19:30 0.0 0 0 2276.6 0 0.0 0
2012-10-10 09:14:00 2283.2 80 98634240 2283.2 144 2283.4 1
2012-10-10 09:15:00 2285.2 18 126814260 2285.2 185 2285.6 3
2012-10-10 09:15:01 2285.8 6 98719560 2286.8 144 2287.0 25
2012-10-10 09:15:01.500000 2287.0 36 144759420 2288.8 211 2289.0 4
2012-10-10 09:15:02 2287.4 6 109829280 2287.4 160 2288.6 5
......
How can I extract the unique date in the datetime format from the above DataFrame? To have result like [2012-10-08, 2012-10-10]
Answers:
Using regex:
(d{4}-d{2}-d{2})
Run it with re.findall
function to get all matches:
result = re.findall(r"(d{4}-d{2}-d{2})", subject)
If you have a Series
like:
In [116]: df["Date"]
Out[116]:
0 2012-10-08 07:12:22
1 2012-10-08 09:14:00
2 2012-10-08 09:15:00
3 2012-10-08 09:15:01
4 2012-10-08 09:15:01.500000
5 2012-10-08 09:15:02
6 2012-10-08 09:15:02.500000
7 2012-10-10 07:19:30
8 2012-10-10 09:14:00
9 2012-10-10 09:15:00
10 2012-10-10 09:15:01
11 2012-10-10 09:15:01.500000
12 2012-10-10 09:15:02
Name: Date
where each object is a Timestamp
:
In [117]: df["Date"][0]
Out[117]: <Timestamp: 2012-10-08 07:12:22>
you can get only the date by calling .date()
:
In [118]: df["Date"][0].date()
Out[118]: datetime.date(2012, 10, 8)
and Series have a .unique()
method. So you can use map
and a lambda
:
In [126]: df["Date"].map(lambda t: t.date()).unique()
Out[126]: array([2012-10-08, 2012-10-10], dtype=object)
or use the Timestamp.date
method:
In [127]: df["Date"].map(pd.Timestamp.date).unique()
Out[127]: array([2012-10-08, 2012-10-10], dtype=object)
Just to give an alternative answer to @DSM, look at this other answer from @Psidom
It would be something like:
pd.to_datetime(df['DateTime']).dt.date.unique()
It seems to me that it performs slightly better
This is what I get on Python 3.6.8 and Pandas 1.1.5:
%timeit df['date'].map(lambda d: d.date()).unique()
2.06 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['date'].dt.date.unique()
535 µs ± 79.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['date'].dt.normalize().unique()
1.33 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Output of normalize().unique()
:
array(['2021-04-08T00:00:00.000000000', '2021-04-07T00:00:00.000000000',
'2021-04-06T00:00:00.000000000', '2021-04-05T00:00:00.000000000',
'2021-04-04T00:00:00.000000000', '2021-04-03T00:00:00.000000000',
'2021-04-02T00:00:00.000000000', '2021-04-01T00:00:00.000000000',
..., dtype='datetime64[ns]')
Versus output of the other 2:
array([datetime.date(2021, 4, 8), datetime.date(2021, 4, 7),
datetime.date(2021, 4, 6), datetime.date(2021, 4, 5),
datetime.date(2021, 4, 4), datetime.date(2021, 4, 3),
datetime.date(2021, 4, 2), datetime.date(2021, 4, 1),
datetime.date(2021, 3, 31), datetime.date(2021, 3, 30),
..., dtype=object)
I have a DataFrame which contains a lot of intraday data, the DataFrame has several days of data, dates are not continuous.
2012-10-08 07:12:22 0.0 0 0 2315.6 0 0.0 0
2012-10-08 09:14:00 2306.4 20 326586240 2306.4 472 2306.8 4
2012-10-08 09:15:00 2306.8 34 249805440 2306.8 361 2308.0 26
2012-10-08 09:15:01 2308.0 1 53309040 2307.4 77 2308.6 9
2012-10-08 09:15:01.500000 2308.2 1 124630140 2307.0 180 2308.4 1
2012-10-08 09:15:02 2307.0 5 85846260 2308.2 124 2308.0 9
2012-10-08 09:15:02.500000 2307.0 3 128073540 2307.0 185 2307.6 11
......
2012-10-10 07:19:30 0.0 0 0 2276.6 0 0.0 0
2012-10-10 09:14:00 2283.2 80 98634240 2283.2 144 2283.4 1
2012-10-10 09:15:00 2285.2 18 126814260 2285.2 185 2285.6 3
2012-10-10 09:15:01 2285.8 6 98719560 2286.8 144 2287.0 25
2012-10-10 09:15:01.500000 2287.0 36 144759420 2288.8 211 2289.0 4
2012-10-10 09:15:02 2287.4 6 109829280 2287.4 160 2288.6 5
......
How can I extract the unique date in the datetime format from the above DataFrame? To have result like [2012-10-08, 2012-10-10]
Using regex:
(d{4}-d{2}-d{2})
Run it with re.findall
function to get all matches:
result = re.findall(r"(d{4}-d{2}-d{2})", subject)
If you have a Series
like:
In [116]: df["Date"]
Out[116]:
0 2012-10-08 07:12:22
1 2012-10-08 09:14:00
2 2012-10-08 09:15:00
3 2012-10-08 09:15:01
4 2012-10-08 09:15:01.500000
5 2012-10-08 09:15:02
6 2012-10-08 09:15:02.500000
7 2012-10-10 07:19:30
8 2012-10-10 09:14:00
9 2012-10-10 09:15:00
10 2012-10-10 09:15:01
11 2012-10-10 09:15:01.500000
12 2012-10-10 09:15:02
Name: Date
where each object is a Timestamp
:
In [117]: df["Date"][0]
Out[117]: <Timestamp: 2012-10-08 07:12:22>
you can get only the date by calling .date()
:
In [118]: df["Date"][0].date()
Out[118]: datetime.date(2012, 10, 8)
and Series have a .unique()
method. So you can use map
and a lambda
:
In [126]: df["Date"].map(lambda t: t.date()).unique()
Out[126]: array([2012-10-08, 2012-10-10], dtype=object)
or use the Timestamp.date
method:
In [127]: df["Date"].map(pd.Timestamp.date).unique()
Out[127]: array([2012-10-08, 2012-10-10], dtype=object)
Just to give an alternative answer to @DSM, look at this other answer from @Psidom
It would be something like:
pd.to_datetime(df['DateTime']).dt.date.unique()
It seems to me that it performs slightly better
This is what I get on Python 3.6.8 and Pandas 1.1.5:
%timeit df['date'].map(lambda d: d.date()).unique()
2.06 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['date'].dt.date.unique()
535 µs ± 79.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df['date'].dt.normalize().unique()
1.33 ms ± 229 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Output of normalize().unique()
:
array(['2021-04-08T00:00:00.000000000', '2021-04-07T00:00:00.000000000',
'2021-04-06T00:00:00.000000000', '2021-04-05T00:00:00.000000000',
'2021-04-04T00:00:00.000000000', '2021-04-03T00:00:00.000000000',
'2021-04-02T00:00:00.000000000', '2021-04-01T00:00:00.000000000',
..., dtype='datetime64[ns]')
Versus output of the other 2:
array([datetime.date(2021, 4, 8), datetime.date(2021, 4, 7),
datetime.date(2021, 4, 6), datetime.date(2021, 4, 5),
datetime.date(2021, 4, 4), datetime.date(2021, 4, 3),
datetime.date(2021, 4, 2), datetime.date(2021, 4, 1),
datetime.date(2021, 3, 31), datetime.date(2021, 3, 30),
..., dtype=object)