pandas filtering and comparing dates
Question:
I have a sql file which consists of the data below which I read into pandas.
df = pandas.read_sql('Database count details', con=engine,
index_col='id', parse_dates='newest_available_date')
Output
id code newest_date_available
9793708 3514 2015-12-24
9792282 2399 2015-12-25
9797602 7452 2015-12-25
9804367 9736 2016-01-20
9804438 9870 2016-01-20
The next line of code is to get last week’s date
date_before = datetime.date.today() - datetime.timedelta(days=7) # Which is 2016-01-20
What I am trying to do is, to compare date_before
with df
and print out all rows that is less than date_before
if (df['newest_available_date'] < date_before):
print(#all rows)
Obviously this returns me an error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How should I do this?
Answers:
I would do a mask like:
a = df[df['newest_date_available'] < date_before]
If date_before = datetime.date(2016, 1, 19)
, this returns:
id code newest_date_available
0 9793708 3514 2015-12-24
1 9792282 2399 2015-12-25
2 9797602 7452 2015-12-25
Using datetime.date(2019, 1, 10)
works because pandas coerces the date to a date time under the hood.
This however, will no longer be the case in future versions of pandas.
From version 0.24 and up, it now issues a warning:
FutureWarning: Comparing Series of datetimes with ‘datetime.date’.
Currently, the ‘datetime.date’ is coerced to a datetime. In the future
pandas will not coerce, and a TypeError will be raised.
The better solution is the one proposed on its official documentation as Pandas’ replacement for Python’s datetime.datetime
object.
To provide an example referencing OP’s initial dataset, this is how you would use it:
import pandas
cond1 = df.newest_date_available < pd.Timestamp(2016,1,10)
df.loc[cond1, ]
A bit late to the party but I think it is worth mentioning. If you are looking for a solution which dynamically considers the date a week ago, this might be helpful:
In [3]: df = pd.DataFrame({'alpha': list('ABCDE'), 'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')})
In [4]: df
Out[4]:
alpha num date
0 A 0 2022-06-30
1 B 1 2022-07-01
2 C 2 2022-07-02
3 D 3 2022-07-03
4 E 4 2022-07-04
In [5]: df.query('date < "%s"' % (pd.Timestamp.now().normalize() - pd.Timedelta(7, 'd')))
Out[5]:
alpha num date
0 A 0 2022-06-30
1 B 1 2022-07-01
Explanation:
I created a new df
with newer dates. Today is 2022-07-09 (pd.Timestamp.now().normalize()
) and seven days ago it was 2022-07-02 (pd.Timestamp.now().normalize() - pd.Timedelta(7, 'd')
). query()
returns only those observations where the dates in column date
are smaller than 2022-07-02 using the string formatting operator %
.
normalize()
is important here to reset the time to midnight. Otherwise query()
will also return observations equal to 2022-07-02, because:
# Timestamp('2022-07-09 17:53:03.078172') > Timestamp('2022-07-09 00:00:00')
In [6]: pd.Timestamp.now() > pd.Timestamp.now().normalize()
Out[6]: True
I have a sql file which consists of the data below which I read into pandas.
df = pandas.read_sql('Database count details', con=engine,
index_col='id', parse_dates='newest_available_date')
Output
id code newest_date_available
9793708 3514 2015-12-24
9792282 2399 2015-12-25
9797602 7452 2015-12-25
9804367 9736 2016-01-20
9804438 9870 2016-01-20
The next line of code is to get last week’s date
date_before = datetime.date.today() - datetime.timedelta(days=7) # Which is 2016-01-20
What I am trying to do is, to compare date_before
with df
and print out all rows that is less than date_before
if (df['newest_available_date'] < date_before):
print(#all rows)
Obviously this returns me an error
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
How should I do this?
I would do a mask like:
a = df[df['newest_date_available'] < date_before]
If date_before = datetime.date(2016, 1, 19)
, this returns:
id code newest_date_available
0 9793708 3514 2015-12-24
1 9792282 2399 2015-12-25
2 9797602 7452 2015-12-25
Using datetime.date(2019, 1, 10)
works because pandas coerces the date to a date time under the hood.
This however, will no longer be the case in future versions of pandas.
From version 0.24 and up, it now issues a warning:
FutureWarning: Comparing Series of datetimes with ‘datetime.date’.
Currently, the ‘datetime.date’ is coerced to a datetime. In the future
pandas will not coerce, and a TypeError will be raised.
The better solution is the one proposed on its official documentation as Pandas’ replacement for Python’s datetime.datetime
object.
To provide an example referencing OP’s initial dataset, this is how you would use it:
import pandas
cond1 = df.newest_date_available < pd.Timestamp(2016,1,10)
df.loc[cond1, ]
A bit late to the party but I think it is worth mentioning. If you are looking for a solution which dynamically considers the date a week ago, this might be helpful:
In [3]: df = pd.DataFrame({'alpha': list('ABCDE'), 'num': range(5), 'date': pd.date_range('2022-06-30', '2022-07-04')})
In [4]: df
Out[4]:
alpha num date
0 A 0 2022-06-30
1 B 1 2022-07-01
2 C 2 2022-07-02
3 D 3 2022-07-03
4 E 4 2022-07-04
In [5]: df.query('date < "%s"' % (pd.Timestamp.now().normalize() - pd.Timedelta(7, 'd')))
Out[5]:
alpha num date
0 A 0 2022-06-30
1 B 1 2022-07-01
Explanation:
I created a new df
with newer dates. Today is 2022-07-09 (pd.Timestamp.now().normalize()
) and seven days ago it was 2022-07-02 (pd.Timestamp.now().normalize() - pd.Timedelta(7, 'd')
). query()
returns only those observations where the dates in column date
are smaller than 2022-07-02 using the string formatting operator %
.
normalize()
is important here to reset the time to midnight. Otherwise query()
will also return observations equal to 2022-07-02, because:
# Timestamp('2022-07-09 17:53:03.078172') > Timestamp('2022-07-09 00:00:00')
In [6]: pd.Timestamp.now() > pd.Timestamp.now().normalize()
Out[6]: True