Calculating working days and holidays from (overlapping) date ranges in PySpark

Question:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Create PySpark dataframe
columns = ["user","hiring_date","termination_date"]
data = [("A", "1995-09-08", "1997-09-09"), ("A", "2003-05-08", "2006-11-09"),
        ("A", "2000-05-06", "2003-05-09"), ("B", "2007-06-27", "2008-05-27"),
        ("C", "2003-01-20", "2006-01-19"), ("C", "2011-04-03", "2011-04-04")]

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

df = spark 
    .createDataFrame(rdd) 
    .toDF(*columns) 
    .withColumn('hiring_date', F.expr('CAST(hiring_date AS DATE)')) 
    .withColumn('termination_date', F.expr('CAST(termination_date AS DATE)'))
    
df.show()

+----+-----------+----------------+
|user|hiring_date|termination_date|
+----+-----------+----------------+
|   A| 1995-09-08|      1997-09-09|
|   A| 2003-05-08|      2006-11-09|
|   A| 2000-05-06|      2003-05-09|
|   B| 2007-06-27|      2008-05-27|
|   C| 2003-01-20|      2006-01-19|
|   C| 2011-04-03|      2011-04-04|
+----+-----------+----------------+

In the above example, I have multiple users with a start date hiring_date and an end date termination_date. Per user, there can be single as well as multiple rows. In addition, users can have multiple jobs at the same time (overlapping termination and hiring dates).

For each user, I need to calculate the following:

  • The number of days the user was working. Overlapping dates should not be counted multiple times.
  • The number of days the user was not working (i.e., was on vacation).
Asked By: JDDS

||

Answers:

Full code (this is implemented in Scala but it is very similar if not identical to Python):

var ds = spark.sparkContext.parallelize(Seq(
  ("A", "1995-09-08", "1997-09-09"),
  ("A", "2003-05-08", "2006-11-09"),
  ("A", "2000-05-06", "2003-05-09"),
  ("B", "2007-06-27", "2008-05-27"),
  ("C", "2003-01-20", "2006-01-19"),
  ("C", "2011-04-03", "2011-04-04"),
)).toDF("user", "hiring_date", "termination_date")

// Convert the strings to date first
ds = ds
  .withColumn("hiring_date", to_date(col("hiring_date"), "yyyy-MM-dd"))
  .withColumn("termination_date", to_date(col("termination_date"), "yyyy-MM-dd"))

// Find the working days for each employee, where we generate dates from start to end for intervals
val workDays = ds
  .withColumn("grouped", sequence(col("hiring_date"), col("termination_date")))
  .withColumn("grouped", explode(col("grouped")))
  // We drop duplicates because of the overlapping dates
  .select("user", "grouped").dropDuplicates()
  // We create an indicator, so we know later which date is holiday and which is not
  .withColumn("ind", lit(1))

// We generate a full history of the first and last date the user was working, for all jobs
val fullDays = ds
  .groupBy("user").agg(min("hiring_date").as("min"), max("termination_date").as("max"))
  .withColumn("grouped", sequence(col("min"), col("max")).as("grouped"))
  .withColumn("grouped", explode(col("grouped")))
  .select("user", "grouped")

// We join fullDays with workDays, wherever 'ind' is 1, we have workdays, otherwise non workdays
val result = fullDays.join(workDays, Seq("user", "grouped"), "left")

// We filter working days, we group by user and we count
val workingDays = result.filter(col("ind").equalTo(1)).groupBy("user").count()
// We filter non working days, we group by user and we count
val nonWorkingDays = result.filter(col("ind").isNull).groupBy("user").count()

workingDays.show(10)
+----+-----+
|user|count|
+----+-----+
|   B|  336|
|   C| 1098|
|   A| 3112|
+----+-----+

nonWorkingDays.show(10)
+----+-----+
|user|count|
+----+-----+
|   C| 1899|
|   A|  969|
+----+-----+

I hope this is what you need, good luck!

Answered By: vilalabinot

If by working days you mean to exclude the weekly holidays (Sat, Sun), we can do that getting an array of dates and then retaining only the dates that fall in the work week (using dayofweek).

data_sdf. 
    withColumn('prev_tdt', 
               func.lag('termination_date').over(wd.partitionBy('user').orderBy('hiring_date'))
               ). 
    withColumn('new_hiredt', 
               func.when(func.col('prev_tdt') >= func.col('hiring_date'), func.date_add('prev_tdt', 1)).
               otherwise(func.col('hiring_date'))
               ). 
    withColumn('date_seq', 
               func.expr('sequence(new_hiredt, termination_date, interval 1 day)')
               ). 
    withColumn('num_workday', 
               func.size(func.expr('filter(date_seq, x -> dayofweek(x) not in (1, 7))'))
               ). 
    withColumn('tot_days', func.size('date_seq')). 
    withColumn('num_nonworkday', 
               func.coalesce(func.datediff('new_hiredt', 'prev_tdt') - 1, func.lit(0))
               ). 
    groupBy('user'). 
    agg(func.sum('num_workday').alias('num_workday'),
        func.sum('num_nonworkday').alias('num_nonworkday')
        ). 
    orderBy('user'). 
    show()

# +----+-----------+--------------+
# |user|num_workday|num_nonworkday|
# +----+-----------+--------------+
# |   A|       2222|           969|
# |   B|        240|             0|
# |   C|        785|          1899|
# +----+-----------+--------------+

If you don’t want to exclude the weekly holidays, you can use the tot_days field as number of work days. The new_hiredt column is created to get the start date for records that have overlap with the previous record’s termination date.

Answered By: samkart

In case anyone is interested in the PySpark solution based on vilalabinot’s post:

# Create PySpark dataframe
columns = ["user","hiring_date","termination_date"]
data = [("A", "1995-09-08", "1997-09-09"), ("A", "2003-05-08", "2006-11-09"),
        ("A", "2000-05-06", "2003-05-09"), ("B", "2007-06-27", "2008-05-27"),
        ("C", "2003-01-20", "2006-01-19"), ("C", "2011-04-03", "2011-04-04")]

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)

df = spark 
    .createDataFrame(rdd) 
    .toDF(*columns) 
    .withColumn('hiring_date', F.expr('CAST(hiring_date AS DATE)')) 
    .withColumn('termination_date', F.expr('CAST(termination_date AS DATE)'))

# Find the working days for each employee, where we generate dates from start to end for intervals
# We drop duplicates because of the overlapping duplicates
# We create an indicator, so we know later which date is holiday and which is not
work_days = df 
    .withColumn("grouped", sequence(col("hiring_date"), col("termination_date"))) 
    .withColumn("grouped", explode(col("grouped"))) 
    .select("user", "grouped").dropDuplicates() 
    .withColumn("ind", lit(1))

# We generate a full history of the first and last
# date the user was working, for all jobs
full_days = df 
    .groupBy("user") 
    .agg(min("hiring_date").alias("min"), max("termination_date").alias("max")) 
    .withColumn("grouped", sequence(col("min"), col("max")).alias("grouped")) 
    .withColumn("grouped", explode(col("grouped"))) 
    .select("user", "grouped")

# We join fullDays with workDays, wherever 'ind'
# is 1, we have workdays, otherwise non workdays
result = full_days.join(work_days, ["user", "grouped"], "left")

# We filter working days, we group by user and we count
working_days = result.filter(col("ind") == 1).groupBy("user").agg(count('user').alias('working_days'))

# We filter non working days, we group by user and we count
nonworking_days = result.filter(col("ind").isNull()).groupBy("user").agg(count('user').alias('nonworking_days'))

# Return original dataframe with new values
df_final = df 
    .select('user') 
    .dropDuplicates() 
    .join(working_days, 'user', 'left') 
    .join(nonworking_days, 'user', 'left')

df_final.show()

+----+------------+---------------+
|user|working_days|nonworking_days|
+----+------------+---------------+
|   B|           2|           null|
|   A|        3112|            969|
|   C|        1098|           1899|
+----+------------+---------------+
Answered By: JDDS

There is an easier way to solve this problem using sets (lists). First, we need to define a function that takes a start date and end date as parameters. It returns a list of dates as strings.

#
# 0 - Create utility function
#

# required library
import pandas as pd

# define function
def expand_date_range_to_list(start_dte, end_dte):
  return pd.date_range(start=start_dte, end=end_dte).strftime("%Y-%m-%d").tolist()

# required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *

# register df function
udf_expand_date_range_to_list = udf(expand_date_range_to_list, ArrayType(StringType()))

# register sql function
spark.udf.register("sql_expand_date_range_to_list", udf_expand_date_range_to_list)

# test function
out = expand_date_range_to_list("2022-09-01", "2022-09-05")
type(out)
out

The output of this test call is the following.

enter image description here

The next task is to create a dataset using the sample data. We will call the spark user defined function to add a new column to the data set ("date_list").

#
# 1 - Create sample dataframe + view
#

# required library
from pyspark.sql.functions import *

# array of tuples - data
dat1 = [
  ("A", "1995-09-08", "1997-09-09"),
  ("A", "2003-05-08", "2006-11-09"),
  ("A", "2000-05-06", "2003-05-09"),
  ("B", "2007-06-27", "2008-05-27"),
  ("C", "2003-01-20", "2006-01-19"),
  ("C", "2011-04-03", "2011-04-04"),

]

# array of names - columns
col1 = ["user", "hiring_date", "termination_date"]

# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)

# expand date range into list of dates
df1 = df1.withColumn("date_list", udf_expand_date_range_to_list(col("hiring_date"), 
col("termination_date") ) )


# make temp hive view
df1.createOrReplaceTempView("employee_data1")

# show schema
df1.printSchema()

# show data
display(df1)

enter image description here

Now that we have our data, we can use SPARK SQL to solve our problem. Please note, I turned the dataframe in a temporary view.

%sql
with cte as
(
select 
  user, 
  explode(date_list) as dates 
from 
  employee_data1
)
select 
  user, 
  datediff(max(dates), min(dates)) as total_days,
  count(distinct dates) as work_days,
  datediff(max(dates), min(dates)) - count(distinct dates) + 1 as unworked_days
from cte
group by user

The explode function takes that array and makes an entry per user and date. The we can use min, max, count distinct, and date diff functions to calculate our answer.

enter image description here

The hard part about holidays is that they are specific to each company. If you save the dates as a csv file with a description and date on each line, you can create another temporary view out of the dataframe. Then you can join this dataframe to the result to figure out the count of holidays.

In short, your problem is solved using an array of date strings and SPARK SQL~!

Answered By: CRAFTY DBA

I would check the numbers above using dataframes. They seem to be off.

enter image description here

I think my solution is more elegant since you are working with sets of data strings.

Filtering for weekends is trivial using the dayofweek() function.

If you take total days (excluding weekends), then calculate distinct work days and unworked days, the first column should equal the sum of the last two columns. My answer shows that the math works!

Answered By: CRAFTY DBA
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.