Calculating working days and holidays from (overlapping) date ranges in PySpark
Question:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Create PySpark dataframe
columns = ["user","hiring_date","termination_date"]
data = [("A", "1995-09-08", "1997-09-09"), ("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"), ("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"), ("C", "2011-04-03", "2011-04-04")]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df = spark
.createDataFrame(rdd)
.toDF(*columns)
.withColumn('hiring_date', F.expr('CAST(hiring_date AS DATE)'))
.withColumn('termination_date', F.expr('CAST(termination_date AS DATE)'))
df.show()
+----+-----------+----------------+
|user|hiring_date|termination_date|
+----+-----------+----------------+
| A| 1995-09-08| 1997-09-09|
| A| 2003-05-08| 2006-11-09|
| A| 2000-05-06| 2003-05-09|
| B| 2007-06-27| 2008-05-27|
| C| 2003-01-20| 2006-01-19|
| C| 2011-04-03| 2011-04-04|
+----+-----------+----------------+
In the above example, I have multiple users with a start date hiring_date
and an end date termination_date
. Per user, there can be single as well as multiple rows. In addition, users can have multiple jobs at the same time (overlapping termination and hiring dates).
For each user, I need to calculate the following:
- The number of days the user was working. Overlapping dates should not be counted multiple times.
- The number of days the user was not working (i.e., was on vacation).
Answers:
Full code (this is implemented in Scala but it is very similar if not identical to Python):
var ds = spark.sparkContext.parallelize(Seq(
("A", "1995-09-08", "1997-09-09"),
("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"),
("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"),
("C", "2011-04-03", "2011-04-04"),
)).toDF("user", "hiring_date", "termination_date")
// Convert the strings to date first
ds = ds
.withColumn("hiring_date", to_date(col("hiring_date"), "yyyy-MM-dd"))
.withColumn("termination_date", to_date(col("termination_date"), "yyyy-MM-dd"))
// Find the working days for each employee, where we generate dates from start to end for intervals
val workDays = ds
.withColumn("grouped", sequence(col("hiring_date"), col("termination_date")))
.withColumn("grouped", explode(col("grouped")))
// We drop duplicates because of the overlapping dates
.select("user", "grouped").dropDuplicates()
// We create an indicator, so we know later which date is holiday and which is not
.withColumn("ind", lit(1))
// We generate a full history of the first and last date the user was working, for all jobs
val fullDays = ds
.groupBy("user").agg(min("hiring_date").as("min"), max("termination_date").as("max"))
.withColumn("grouped", sequence(col("min"), col("max")).as("grouped"))
.withColumn("grouped", explode(col("grouped")))
.select("user", "grouped")
// We join fullDays with workDays, wherever 'ind' is 1, we have workdays, otherwise non workdays
val result = fullDays.join(workDays, Seq("user", "grouped"), "left")
// We filter working days, we group by user and we count
val workingDays = result.filter(col("ind").equalTo(1)).groupBy("user").count()
// We filter non working days, we group by user and we count
val nonWorkingDays = result.filter(col("ind").isNull).groupBy("user").count()
workingDays.show(10)
+----+-----+
|user|count|
+----+-----+
| B| 336|
| C| 1098|
| A| 3112|
+----+-----+
nonWorkingDays.show(10)
+----+-----+
|user|count|
+----+-----+
| C| 1899|
| A| 969|
+----+-----+
I hope this is what you need, good luck!
If by working days you mean to exclude the weekly holidays (Sat, Sun), we can do that getting an array of dates and then retaining only the dates that fall in the work week (using dayofweek
).
data_sdf.
withColumn('prev_tdt',
func.lag('termination_date').over(wd.partitionBy('user').orderBy('hiring_date'))
).
withColumn('new_hiredt',
func.when(func.col('prev_tdt') >= func.col('hiring_date'), func.date_add('prev_tdt', 1)).
otherwise(func.col('hiring_date'))
).
withColumn('date_seq',
func.expr('sequence(new_hiredt, termination_date, interval 1 day)')
).
withColumn('num_workday',
func.size(func.expr('filter(date_seq, x -> dayofweek(x) not in (1, 7))'))
).
withColumn('tot_days', func.size('date_seq')).
withColumn('num_nonworkday',
func.coalesce(func.datediff('new_hiredt', 'prev_tdt') - 1, func.lit(0))
).
groupBy('user').
agg(func.sum('num_workday').alias('num_workday'),
func.sum('num_nonworkday').alias('num_nonworkday')
).
orderBy('user').
show()
# +----+-----------+--------------+
# |user|num_workday|num_nonworkday|
# +----+-----------+--------------+
# | A| 2222| 969|
# | B| 240| 0|
# | C| 785| 1899|
# +----+-----------+--------------+
If you don’t want to exclude the weekly holidays, you can use the tot_days
field as number of work days. The new_hiredt
column is created to get the start date for records that have overlap with the previous record’s termination date.
In case anyone is interested in the PySpark solution based on vilalabinot’s post:
# Create PySpark dataframe
columns = ["user","hiring_date","termination_date"]
data = [("A", "1995-09-08", "1997-09-09"), ("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"), ("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"), ("C", "2011-04-03", "2011-04-04")]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df = spark
.createDataFrame(rdd)
.toDF(*columns)
.withColumn('hiring_date', F.expr('CAST(hiring_date AS DATE)'))
.withColumn('termination_date', F.expr('CAST(termination_date AS DATE)'))
# Find the working days for each employee, where we generate dates from start to end for intervals
# We drop duplicates because of the overlapping duplicates
# We create an indicator, so we know later which date is holiday and which is not
work_days = df
.withColumn("grouped", sequence(col("hiring_date"), col("termination_date")))
.withColumn("grouped", explode(col("grouped")))
.select("user", "grouped").dropDuplicates()
.withColumn("ind", lit(1))
# We generate a full history of the first and last
# date the user was working, for all jobs
full_days = df
.groupBy("user")
.agg(min("hiring_date").alias("min"), max("termination_date").alias("max"))
.withColumn("grouped", sequence(col("min"), col("max")).alias("grouped"))
.withColumn("grouped", explode(col("grouped")))
.select("user", "grouped")
# We join fullDays with workDays, wherever 'ind'
# is 1, we have workdays, otherwise non workdays
result = full_days.join(work_days, ["user", "grouped"], "left")
# We filter working days, we group by user and we count
working_days = result.filter(col("ind") == 1).groupBy("user").agg(count('user').alias('working_days'))
# We filter non working days, we group by user and we count
nonworking_days = result.filter(col("ind").isNull()).groupBy("user").agg(count('user').alias('nonworking_days'))
# Return original dataframe with new values
df_final = df
.select('user')
.dropDuplicates()
.join(working_days, 'user', 'left')
.join(nonworking_days, 'user', 'left')
df_final.show()
+----+------------+---------------+
|user|working_days|nonworking_days|
+----+------------+---------------+
| B| 2| null|
| A| 3112| 969|
| C| 1098| 1899|
+----+------------+---------------+
There is an easier way to solve this problem using sets (lists). First, we need to define a function that takes a start date and end date as parameters. It returns a list of dates as strings.
#
# 0 - Create utility function
#
# required library
import pandas as pd
# define function
def expand_date_range_to_list(start_dte, end_dte):
return pd.date_range(start=start_dte, end=end_dte).strftime("%Y-%m-%d").tolist()
# required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
# register df function
udf_expand_date_range_to_list = udf(expand_date_range_to_list, ArrayType(StringType()))
# register sql function
spark.udf.register("sql_expand_date_range_to_list", udf_expand_date_range_to_list)
# test function
out = expand_date_range_to_list("2022-09-01", "2022-09-05")
type(out)
out
The output of this test call is the following.
The next task is to create a dataset using the sample data. We will call the spark user defined function to add a new column to the data set ("date_list").
#
# 1 - Create sample dataframe + view
#
# required library
from pyspark.sql.functions import *
# array of tuples - data
dat1 = [
("A", "1995-09-08", "1997-09-09"),
("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"),
("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"),
("C", "2011-04-03", "2011-04-04"),
]
# array of names - columns
col1 = ["user", "hiring_date", "termination_date"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# expand date range into list of dates
df1 = df1.withColumn("date_list", udf_expand_date_range_to_list(col("hiring_date"),
col("termination_date") ) )
# make temp hive view
df1.createOrReplaceTempView("employee_data1")
# show schema
df1.printSchema()
# show data
display(df1)
Now that we have our data, we can use SPARK SQL to solve our problem. Please note, I turned the dataframe in a temporary view.
%sql
with cte as
(
select
user,
explode(date_list) as dates
from
employee_data1
)
select
user,
datediff(max(dates), min(dates)) as total_days,
count(distinct dates) as work_days,
datediff(max(dates), min(dates)) - count(distinct dates) + 1 as unworked_days
from cte
group by user
The explode function takes that array and makes an entry per user and date. The we can use min, max, count distinct, and date diff functions to calculate our answer.
The hard part about holidays is that they are specific to each company. If you save the dates as a csv file with a description and date on each line, you can create another temporary view out of the dataframe. Then you can join this dataframe to the result to figure out the count of holidays.
In short, your problem is solved using an array of date strings and SPARK SQL~!
I would check the numbers above using dataframes. They seem to be off.
I think my solution is more elegant since you are working with sets of data strings.
Filtering for weekends is trivial using the dayofweek() function.
If you take total days (excluding weekends), then calculate distinct work days and unworked days, the first column should equal the sum of the last two columns. My answer shows that the math works!
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
# Create PySpark dataframe
columns = ["user","hiring_date","termination_date"]
data = [("A", "1995-09-08", "1997-09-09"), ("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"), ("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"), ("C", "2011-04-03", "2011-04-04")]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df = spark
.createDataFrame(rdd)
.toDF(*columns)
.withColumn('hiring_date', F.expr('CAST(hiring_date AS DATE)'))
.withColumn('termination_date', F.expr('CAST(termination_date AS DATE)'))
df.show()
+----+-----------+----------------+
|user|hiring_date|termination_date|
+----+-----------+----------------+
| A| 1995-09-08| 1997-09-09|
| A| 2003-05-08| 2006-11-09|
| A| 2000-05-06| 2003-05-09|
| B| 2007-06-27| 2008-05-27|
| C| 2003-01-20| 2006-01-19|
| C| 2011-04-03| 2011-04-04|
+----+-----------+----------------+
In the above example, I have multiple users with a start date hiring_date
and an end date termination_date
. Per user, there can be single as well as multiple rows. In addition, users can have multiple jobs at the same time (overlapping termination and hiring dates).
For each user, I need to calculate the following:
- The number of days the user was working. Overlapping dates should not be counted multiple times.
- The number of days the user was not working (i.e., was on vacation).
Full code (this is implemented in Scala but it is very similar if not identical to Python):
var ds = spark.sparkContext.parallelize(Seq(
("A", "1995-09-08", "1997-09-09"),
("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"),
("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"),
("C", "2011-04-03", "2011-04-04"),
)).toDF("user", "hiring_date", "termination_date")
// Convert the strings to date first
ds = ds
.withColumn("hiring_date", to_date(col("hiring_date"), "yyyy-MM-dd"))
.withColumn("termination_date", to_date(col("termination_date"), "yyyy-MM-dd"))
// Find the working days for each employee, where we generate dates from start to end for intervals
val workDays = ds
.withColumn("grouped", sequence(col("hiring_date"), col("termination_date")))
.withColumn("grouped", explode(col("grouped")))
// We drop duplicates because of the overlapping dates
.select("user", "grouped").dropDuplicates()
// We create an indicator, so we know later which date is holiday and which is not
.withColumn("ind", lit(1))
// We generate a full history of the first and last date the user was working, for all jobs
val fullDays = ds
.groupBy("user").agg(min("hiring_date").as("min"), max("termination_date").as("max"))
.withColumn("grouped", sequence(col("min"), col("max")).as("grouped"))
.withColumn("grouped", explode(col("grouped")))
.select("user", "grouped")
// We join fullDays with workDays, wherever 'ind' is 1, we have workdays, otherwise non workdays
val result = fullDays.join(workDays, Seq("user", "grouped"), "left")
// We filter working days, we group by user and we count
val workingDays = result.filter(col("ind").equalTo(1)).groupBy("user").count()
// We filter non working days, we group by user and we count
val nonWorkingDays = result.filter(col("ind").isNull).groupBy("user").count()
workingDays.show(10)
+----+-----+
|user|count|
+----+-----+
| B| 336|
| C| 1098|
| A| 3112|
+----+-----+
nonWorkingDays.show(10)
+----+-----+
|user|count|
+----+-----+
| C| 1899|
| A| 969|
+----+-----+
I hope this is what you need, good luck!
If by working days you mean to exclude the weekly holidays (Sat, Sun), we can do that getting an array of dates and then retaining only the dates that fall in the work week (using dayofweek
).
data_sdf.
withColumn('prev_tdt',
func.lag('termination_date').over(wd.partitionBy('user').orderBy('hiring_date'))
).
withColumn('new_hiredt',
func.when(func.col('prev_tdt') >= func.col('hiring_date'), func.date_add('prev_tdt', 1)).
otherwise(func.col('hiring_date'))
).
withColumn('date_seq',
func.expr('sequence(new_hiredt, termination_date, interval 1 day)')
).
withColumn('num_workday',
func.size(func.expr('filter(date_seq, x -> dayofweek(x) not in (1, 7))'))
).
withColumn('tot_days', func.size('date_seq')).
withColumn('num_nonworkday',
func.coalesce(func.datediff('new_hiredt', 'prev_tdt') - 1, func.lit(0))
).
groupBy('user').
agg(func.sum('num_workday').alias('num_workday'),
func.sum('num_nonworkday').alias('num_nonworkday')
).
orderBy('user').
show()
# +----+-----------+--------------+
# |user|num_workday|num_nonworkday|
# +----+-----------+--------------+
# | A| 2222| 969|
# | B| 240| 0|
# | C| 785| 1899|
# +----+-----------+--------------+
If you don’t want to exclude the weekly holidays, you can use the tot_days
field as number of work days. The new_hiredt
column is created to get the start date for records that have overlap with the previous record’s termination date.
In case anyone is interested in the PySpark solution based on vilalabinot’s post:
# Create PySpark dataframe
columns = ["user","hiring_date","termination_date"]
data = [("A", "1995-09-08", "1997-09-09"), ("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"), ("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"), ("C", "2011-04-03", "2011-04-04")]
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
rdd = spark.sparkContext.parallelize(data)
df = spark
.createDataFrame(rdd)
.toDF(*columns)
.withColumn('hiring_date', F.expr('CAST(hiring_date AS DATE)'))
.withColumn('termination_date', F.expr('CAST(termination_date AS DATE)'))
# Find the working days for each employee, where we generate dates from start to end for intervals
# We drop duplicates because of the overlapping duplicates
# We create an indicator, so we know later which date is holiday and which is not
work_days = df
.withColumn("grouped", sequence(col("hiring_date"), col("termination_date")))
.withColumn("grouped", explode(col("grouped")))
.select("user", "grouped").dropDuplicates()
.withColumn("ind", lit(1))
# We generate a full history of the first and last
# date the user was working, for all jobs
full_days = df
.groupBy("user")
.agg(min("hiring_date").alias("min"), max("termination_date").alias("max"))
.withColumn("grouped", sequence(col("min"), col("max")).alias("grouped"))
.withColumn("grouped", explode(col("grouped")))
.select("user", "grouped")
# We join fullDays with workDays, wherever 'ind'
# is 1, we have workdays, otherwise non workdays
result = full_days.join(work_days, ["user", "grouped"], "left")
# We filter working days, we group by user and we count
working_days = result.filter(col("ind") == 1).groupBy("user").agg(count('user').alias('working_days'))
# We filter non working days, we group by user and we count
nonworking_days = result.filter(col("ind").isNull()).groupBy("user").agg(count('user').alias('nonworking_days'))
# Return original dataframe with new values
df_final = df
.select('user')
.dropDuplicates()
.join(working_days, 'user', 'left')
.join(nonworking_days, 'user', 'left')
df_final.show()
+----+------------+---------------+
|user|working_days|nonworking_days|
+----+------------+---------------+
| B| 2| null|
| A| 3112| 969|
| C| 1098| 1899|
+----+------------+---------------+
There is an easier way to solve this problem using sets (lists). First, we need to define a function that takes a start date and end date as parameters. It returns a list of dates as strings.
#
# 0 - Create utility function
#
# required library
import pandas as pd
# define function
def expand_date_range_to_list(start_dte, end_dte):
return pd.date_range(start=start_dte, end=end_dte).strftime("%Y-%m-%d").tolist()
# required libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
# register df function
udf_expand_date_range_to_list = udf(expand_date_range_to_list, ArrayType(StringType()))
# register sql function
spark.udf.register("sql_expand_date_range_to_list", udf_expand_date_range_to_list)
# test function
out = expand_date_range_to_list("2022-09-01", "2022-09-05")
type(out)
out
The output of this test call is the following.
The next task is to create a dataset using the sample data. We will call the spark user defined function to add a new column to the data set ("date_list").
#
# 1 - Create sample dataframe + view
#
# required library
from pyspark.sql.functions import *
# array of tuples - data
dat1 = [
("A", "1995-09-08", "1997-09-09"),
("A", "2003-05-08", "2006-11-09"),
("A", "2000-05-06", "2003-05-09"),
("B", "2007-06-27", "2008-05-27"),
("C", "2003-01-20", "2006-01-19"),
("C", "2011-04-03", "2011-04-04"),
]
# array of names - columns
col1 = ["user", "hiring_date", "termination_date"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# expand date range into list of dates
df1 = df1.withColumn("date_list", udf_expand_date_range_to_list(col("hiring_date"),
col("termination_date") ) )
# make temp hive view
df1.createOrReplaceTempView("employee_data1")
# show schema
df1.printSchema()
# show data
display(df1)
Now that we have our data, we can use SPARK SQL to solve our problem. Please note, I turned the dataframe in a temporary view.
%sql
with cte as
(
select
user,
explode(date_list) as dates
from
employee_data1
)
select
user,
datediff(max(dates), min(dates)) as total_days,
count(distinct dates) as work_days,
datediff(max(dates), min(dates)) - count(distinct dates) + 1 as unworked_days
from cte
group by user
The explode function takes that array and makes an entry per user and date. The we can use min, max, count distinct, and date diff functions to calculate our answer.
The hard part about holidays is that they are specific to each company. If you save the dates as a csv file with a description and date on each line, you can create another temporary view out of the dataframe. Then you can join this dataframe to the result to figure out the count of holidays.
In short, your problem is solved using an array of date strings and SPARK SQL~!
I would check the numbers above using dataframes. They seem to be off.
I think my solution is more elegant since you are working with sets of data strings.
Filtering for weekends is trivial using the dayofweek() function.
If you take total days (excluding weekends), then calculate distinct work days and unworked days, the first column should equal the sum of the last two columns. My answer shows that the math works!