determine if datetime index is within a list of date ranges

Question:

i have the following code data…

import pandas as pd, numpy as np
from datetime import datetime
end_dt = datetime.today()
st_dt = (end_dt + pd.DateOffset(-10)).date()
df_index = pd.date_range(st_dt, end_dt)
df = pd.DataFrame(index=df_index, columns=['in_range'])

data = [pd.to_datetime(['2022-11-08','2022-11-10']), pd.to_datetime(['2022-11-13','2022-11-15'])]
dt_ranges = pd.DataFrame(data,columns={'st_dt':'datetimens[64]', 'end_dt': 'datetimens[64]'})

this produces the following two dataframes:
df:

            in_range
2022-11-08  NaN
2022-11-09  NaN
2022-11-10  NaN
2022-11-11  NaN
2022-11-12  NaN
2022-11-13  NaN
2022-11-14  NaN
2022-11-15  NaN
2022-11-16  NaN
2022-11-17  NaN
2022-11-18  NaN

and date_ranges:

    st_dt       end_dt
0   2022-11-08  2022-11-10
1   2022-11-13  2022-11-15

I would like to update the ‘in_range’ column to indicate if the index falls within any of the pairs of start and end dates of the 2nd dataframe. so i should end up with this:

            in_range
2022-11-08  True
2022-11-09  True
2022-11-10  True
2022-11-11  NaN
2022-11-12  NaN
2022-11-13  True
2022-11-14  True
2022-11-15  True
2022-11-16  NaN
2022-11-17  NaN
2022-11-18  NaN

I’ve gone down the path of trying to do this with using lambda and iteration. but to me that seems in efficient.

    def in_range(index_date, date_ranges):
        for r in date_ranges.values:
            if (r[0] >= index_date) & (r[1] <= index_date):
                return True
        return False

     df['in_range'] = df.reset_index().apply(lambda x: in_range(x.date, dt_ranges), axis=1)

the above sets in_range to NaNs always, despite the code returning the correct values. i suspect it’s because i am resetting the index and so it can not align. Also, as mentioned – this solution probably is pretty inefficient

is there a more pythonic/pandemic way of doing this?

Asked By: mike01010

||

Answers:

Use merge_asof and boolean indexing:

s = df.index.to_series()
m = (pd.merge_asof(s.rename('st_dt'), dt_ranges)
     ['end_dt'].ge(s.to_numpy()).to_numpy()
     )

df.loc[m, 'in_range'] = True

NB. The intervals in dt_ranges should be non-overlapping.

Output:

           in_range
2022-11-08     True
2022-11-09     True
2022-11-10     True
2022-11-11      NaN
2022-11-12      NaN
2022-11-13     True
2022-11-14     True
2022-11-15     True
2022-11-16      NaN
2022-11-17      NaN
2022-11-18      NaN
Answered By: mozway

One option is to compute the non-equi join is with conditional_join, which can handle overlaps:

# pip install pyjanitor
import pandas as pd
import janitor
(
df
.reset_index()
.conditional_join(
    dt_ranges, 
    ('index', 'st_dt', '>='), 
    ('index', 'end_dt', '<='), 
    # depending on your data size
    # setting use_numba to True
    # can improve performance
    # of course, this requires numba installed
    use_numba = False,
    how = 'left', 
    # performance is better when
    # sort_by_appearance is False
    sort_by_appearance=True)
.assign(in_range = lambda df: df.in_range.mask(df.st_dt.notna(), True))
.iloc[:, :2]
.set_index('index')
)

           in_range
index              
2022-11-08     True
2022-11-09     True
2022-11-10     True
2022-11-11      NaN
2022-11-12      NaN
2022-11-13     True
2022-11-14     True
2022-11-15     True
2022-11-16      NaN
2022-11-17      NaN
2022-11-18      NaN
Answered By: sammywemmy
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.