Count of the total number of rows meeting a specific condition between two different dates of another DataFrame

Question:

Be the following python pandas DataFrame:

| num_ID | start_date  | end_date   | time              |
| ------ | ----------- | ---------- | ----------------- |
| 1      | 2022-02-10  | 2022-02-11 | 0 days 09:23:00   |
| 1      | 2022-02-12  | 2022-02-15 | 2 days 12:23:00   |
| 2      | 2022-02-12  | 2022-02-15 | 2 days 10:23:00   |
| 2      | 2022-02-05  | 2022-02-27 | 22 days 02:35:00  |
| 3      | 2022-02-04  | 2022-02-06 | 1 days 19:55:00   |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:21:00   |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:15:00   |

And the following DataFrame containing consecutive dates with their respective holiday values in the is_holiday column.

| date       | is_holiday | name | other |
| ---------- | ---------- | ---- | ----- |
| 2022-01-01 | True       | ABC  | red   |
| 2022-01-02 | False      | CNA  | blue  |
...
# we assume in this case that the omitted rows have the value False in column 
| 2022-02-15 | True       | OOO  | red   |
| 2022-02-16 | True       | POO  | red   |
| 2022-02-17 | False      | KTY  | blue  |
...
| 2023-12-30 | False      | TTE  | white |
| 2023-12-31 | True       | VVV  | red   |

I want to add a new column total_days to the initial DataFrame that indicates the total holidays marked True in second DataFrame that each row passes between the two dates (start_date and end_date).

Output result example:

| num_ID | start_date  | end_date   | time              | total_days     |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1      | 2022-02-10  | 2022-02-11 | 0 days 09:23:00   | 0              |
| 1      | 2022-02-12  | 2022-02-15 | 2 days 12:23:00   | 1              |
| 2      | 2022-02-12  | 2022-02-15 | 2 days 10:23:00   | 1              |
| 2      | 2022-02-05  | 2022-02-27 | 22 days 02:35:00  | 2              |
| 3      | 2022-02-04  | 2022-02-06 | 1 days 19:55:00   | 0              |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:21:00   | 1              |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:15:00   | 1              |

Edit: The solution offered by @jezrael adds more days by grouping by previous intervals. Wrong result:

| num_ID | start_date  | end_date   | time              | total_days     |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1      | 2022-02-10  | 2022-02-11 | 0 days 09:23:00   | 0              |
| 1      | 2022-02-12  | 2022-02-15 | 2 days 12:23:00   | 3              |
| 2      | 2022-02-12  | 2022-02-15 | 2 days 10:23:00   | 3              |
| 2      | 2022-02-05  | 2022-02-27 | 22 days 02:35:00  | 2              |
| 3      | 2022-02-04  | 2022-02-06 | 1 days 19:55:00   | 0              |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:21:00   | 3              |

New Edit: The new solution offered by @jezrael offers another error:

| num_ID | start_date  | end_date   | time              | total_days     |
| ------ | ----------- | ---------- | ----------------- | -------------- |
| 1      | 2022-02-10  | 2022-02-11 | 0 days 09:23:00   | 0              |
| 1      | 2022-02-12  | 2022-02-15 | 2 days 12:23:00   | 1              |
| 2      | 2022-02-12  | 2022-02-15 | 2 days 10:23:00   | 1              |
| 2      | 2022-02-05  | 2022-02-27 | 22 days 02:35:00  | 2              |
| 3      | 2022-02-04  | 2022-02-06 | 1 days 19:55:00   | 0              |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:21:00   | 2              |
| 3      | 2022-02-12  | 2022-02-15 | 2 days 05:15:00   | 2              |
Asked By: Carola

||

Answers:

EDIT: Because need count per rows separately matched dates create date_range and count matched values by Index.isin with sum:

L = df1.loc[df1['is_holiday'], 'date'].tolist()

df['total_holidays'] = [pd.date_range(s, e).isin(L).sum() 
                        for s, e in zip(df['start_date'], df['end_date'])]
print (df)
   num_ID start_date   end_date              time  total_holidays
0       1 2022-02-10 2022-02-11   0 days 09:23:00               0
1       1 2022-02-12 2022-02-15   2 days 12:23:00               1
2       2 2022-02-12 2022-02-15   2 days 10:23:00               1
3       2 2022-02-05 2022-02-27   2 days 02:35:00               2
4       3 2022-02-04 2022-02-06   1 days 19:55:00               0
5       3 2022-02-12 2022-02-15   2 days 05:21:00               1
6       3 2022-02-12 2022-02-15   2 days 05:21:00               1

Another idea with length of indices after Index.intersection:

L = df1.loc[df1['is_holiday'], 'date'].tolist()

df['total_holidays'] = [len(pd.date_range(s, e).intersection(L)) 
                        for s, e in zip(df['start_date'], df['end_date'])]
print (df)
   num_ID start_date   end_date              time  total_holidays
0       1 2022-02-10 2022-02-11   0 days 09:23:00               0
1       1 2022-02-12 2022-02-15   2 days 12:23:00               1
2       2 2022-02-12 2022-02-15   2 days 10:23:00               1
3       2 2022-02-05 2022-02-27   2 days 02:35:00               2
4       3 2022-02-04 2022-02-06   1 days 19:55:00               0
5       3 2022-02-12 2022-02-15   2 days 05:21:00               1
6       3 2022-02-12 2022-02-15   2 days 05:21:00               1

Or intersection of sets:

sets = set(df1.loc[df1['is_holiday'], 'date'])

df['total_holidays'] = [len(set(pd.date_range(s, e)) & sets)
                        for s, e in zip(df['start_date'], df['end_date'])]
print (df)
   num_ID start_date   end_date              time  total_holidays
0       1 2022-02-10 2022-02-11   0 days 09:23:00               0
1       1 2022-02-12 2022-02-15   2 days 12:23:00               1
2       2 2022-02-12 2022-02-15   2 days 10:23:00               1
3       2 2022-02-05 2022-02-27   2 days 02:35:00               2
4       3 2022-02-04 2022-02-06   1 days 19:55:00               0
5       3 2022-02-12 2022-02-15   2 days 05:21:00               1
6       3 2022-02-12 2022-02-15   2 days 05:21:00               1
Answered By: jezrael

If your data is small, a cartesian join is fine; as your data increases, it becomes inefficient, as you are comparing every row between both dataframes. A better way is to use some form of binary search, to get your matches – conditional_join from pyjanitor offers an efficient way for non-equi joins:

# pip install pyjanitor
# you can install the dev version for latest improvements
# pip install git + https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df2.date = pd.to_datetime(df2.date)
# relevant columns
cols = [*df.columns, 'is_holiday']

out = (df
       .conditional_join(
            df2.loc[df2.is_holiday == "True"], 
            ('start_date', 'date', '<='), 
            ('end_date', 'date', '>='), 
            how = 'inner')
       .loc(axis = 1)[cols]
       .groupby(cols[:-1])
       .size()
       .rename('total_days')
       )

Merge back to the original dataframe to get the final output

(df
.merge(out, how = 'left', on = cols[:-1])
# fillna is faster on a Series
.assign(total_days = lambda df: df.total_days.fillna(0, downcast = 'infer'))
) 
   num_ID start_date   end_date              time  total_days
0       1 2022-02-14 2022-02-15   0 days 09:23:00           1
1       2 2022-02-12 2022-02-15   2 days 10:23:00           1
2       2 2022-02-05 2022-02-27  22 days 02:35:00           2
3       3 2022-02-04 2022-02-06   1 days 19:55:00           0

With the dev version, you could preselect columns and also possibly avoid the merge back to the original dataframe. At any rate, for performance, if you can, avoid a cross join.

Answered By: sammywemmy

Let me simplify the example and drop uninvolved columns:

df = pd.DataFrame({
    'num_ID': [1,1,2,2,3,3,3],
    'start_date': ['2022-02-10', '2022-02-12', '2022-02-12', '2022-02-05', '2022-02-04', '2022-02-12', '2022-02-12'],
    'end_date': ['2022-02-11', '2022-02-15', '2022-02-15', '2022-02-27', '2022-02-06', '2022-02-15', '2022-02-15']
})
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

df_dates = pd.DataFrame({
    'date': pd.date_range('2022-01-01', '2023-12-31', freq='D'),
    'is_holiday': [False] * 730
})
df_dates.loc[
    df_dates['date'].isin(pd.to_datetime(['2022-01-01', '2022-02-15', '2022-02-16', '2023-12-31'])),
    'is_holiday'
] = True

Now we have df which is initial DataFrame and df_dates – the DataFrame with information about the holidays.

First step is to set the date as an index in the second DataFrame:

df_dates.set_index('date', inplace=True)
df_dates.sort_index(inplace=True) # just in case the dates are not ordered

When it’s done, we can get the desired metric applying simple function to each row:

def get_n_holidays(row):
    # The function is pretty simple, feel free to replace it with lambda 
    return df_dates.loc[row['start_date']:row['end_date'], 'is_holiday'].sum()

df['total_days'] = df.apply(get_n_holidays, axis=1)

The result:

num_ID  start_date  end_date    total_days
0   1   2022-02-10  2022-02-11  0
1   1   2022-02-12  2022-02-15  1
2   2   2022-02-12  2022-02-15  1
3   2   2022-02-05  2022-02-27  2
4   3   2022-02-04  2022-02-06  0
5   3   2022-02-12  2022-02-15  1
6   3   2022-02-12  2022-02-15  1

NOTE: the solution assumes that all dates within the given date range are presented in df_dates

Answered By: Boris Silantev

Option 1

There are various ways to achieve OP’s goal. One would be using .apply() with a custom lambda function as follows

df['total_days'] = df.apply(lambda x: df2[(df2['date'] >= x['start_date']) & (df2['date'] <= x['end_date']) & (df2['is_holiday'] == True)].shape[0], axis=1)

[Out]:

   num_ID  start_date    end_date              time  total_days
0       1  2022-02-10  2022-02-11   0 days 09:23:00           0
1       1  2022-02-12  2022-02-15   2 days 12:23:00           1
2       2  2022-02-12  2022-02-15   2 days 10:23:00           1
3       2  2022-02-05  2022-02-27  22 days 02:35:00           2
4       3  2022-02-04  2022-02-06   1 days 19:55:00           0
5       3  2022-02-12  2022-02-15   2 days 05:21:00           1
6       3  2022-02-12  2022-02-15   2 days 05:15:00           1

Option 2

Assuming one doesn’t want to use .apply() (see the last note below), one can use a list comprehension with .query() and .itertuples()

df['total_days'] = [len([i for i in df2.query('date >= @x.start_date and date <= @x.end_date and is_holiday == True').index]) for x in df.itertuples()]

[Out]:

   num_ID start_date   end_date              time  total_days
0       1 2022-02-10 2022-02-11   0 days 09:23:00           0
1       1 2022-02-12 2022-02-15   2 days 12:23:00           1
2       2 2022-02-12 2022-02-15   2 days 10:23:00           1
3       2 2022-02-05 2022-02-27  22 days 02:35:00           2
4       3 2022-02-04 2022-02-06   1 days 19:55:00           0
5       3 2022-02-12 2022-02-15   2 days 05:21:00           1
6       3 2022-02-12 2022-02-15   2 days 05:15:00           1

Option 3

Another option, would be with .apply() and numpy.sum as follows

import numpy as np

df['total_days'] = df.apply(lambda x: np.sum((df2['date'] >= x['start_date']) & (df2['date'] <= x['end_date']) & (df2['is_holiday'] == True)), axis=1)

[Out]: 

   num_ID  start_date    end_date              time  total_days
0       1  2022-02-10  2022-02-11   0 days 09:23:00           0
1       1  2022-02-12  2022-02-15   2 days 12:23:00           1
2       2  2022-02-12  2022-02-15   2 days 10:23:00           1
3       2  2022-02-05  2022-02-27  22 days 02:35:00           2
4       3  2022-02-04  2022-02-06   1 days 19:55:00           0
5       3  2022-02-12  2022-02-15   2 days 05:21:00           1
6       3  2022-02-12  2022-02-15   2 days 05:15:00           1

Notes:

  • that each row passes between the two dates

    For that in Options 1 and 3, one is doing the following:

    • (df2['date'] >= x['start_date']) – indicates that the date in df2 is greater than or equal to the start_date in df.

    • (df2['date'] <= x['end_date']) – means that the date in df2 is less than or equal to the end_date in df.

  • holidays marked True in second DataFrame

    And for that, in option 1 and 3, one is using:

    • (df2['is_holiday'] == True)
  • pandas.DataFrame.shape returns a tuple representing the dimensionality of a given DataFrame. For this specific case one wants only the first element of it, so one uses .shape[0].

  • There are strong opinions on the usage of .apply(). Therefore, would suggest one to read this: When should I (not) want to use pandas apply() in my code?

Answered By: Gonçalo Peres
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.