Calendarized cost by year and month in Spark

Question:

I am fairly new to PySpark and looking for the best way to perform the following calculations:
I have the following data frame:

+-------------+------------+--------------+------------+------------+-----+
|invoice_month|invoice_year|start_date_key|end_date_key|invoice_days| cost|
+-------------+------------+--------------+------------+------------+-----+
|           11|        2007|      20071022|    20071120|          30|  100|
|           12|        2007|      20071121|    20071220|          30|  160|
|            5|        2014|      20140423|    20140522|          30|  600|
|            5|        2005|      20050503|    20050602|          31|  470|
|            7|        2012|      20120702|    20120801|          31|  200|
|            7|        2013|      20130712|    20130812|          32|  300|
|            2|        2010|      20100212|    20100316|          33|  640|
|           12|        2013|      20130619|    20130828|          71|  820|
+-------------+------------+--------------+------------+------------+-----+

What I am trying to calculate is the calendarized cost by invoice month and year. For example, the first invoice spans across 2 months (October & November), the prorated cost for the first invoice of November should be 20/30 * 100 = 66.67. Then the prorated cost for the second invoice of November should be 10/30 (from 11-21 to 11-30) * 160 = 53.33. So the calendarized cost of the invoice for November 2007 should be 66.67 + 53.33 = 120.

My initial thought was to use a brute force approach, create a separate data frame and to go through the unique tuples of (invoice month, invoice year) row by row, join back to this original data frame select all the invoices that falls within range based on start_date_key and end_date_key and calculate for each. The calculation would be even more tricky when there’s an invoice that spans more than 2 months like the last invoice. Would that be a way to expand the existing data frame and create additional weighted columns based on start_date_key and end_date_key, for example, I would create 201306, 201307, 201308 columns for the last invoice such that I can calculate the weighted cost for each and perform an aggregate.

I am not sure if there is a more efficient way since I am fairly new to PySpark in general. Any hints would be much appreciated!

Asked By: yuen23

||

Answers:

The idea is to use an udf to split each invoice into monthly intervals and then assign each month of each interval the correct share of the costs.

We create a new column (intervals) that contains an array of structs. There is one entry in the array for each month that belongs to the invoice and each struct within the array contains three values: year, month and the share of the costs. Finally the array column is exploded, grouped by month and year and the costs are summed up:

from pyspark.sql import types as T

calc_intervals_udf=F.udf(calc_intervals, returnType = T.ArrayType(
  T.StructType([T.StructField("year", T.IntegerType()),
                T.StructField("month", T.IntegerType()), 
                T.StructField("cost", T.FloatType())])))

df.withColumn("intervals", calc_intervals_udf("start_date_key", "end_date_key", "cost")) 
  .select("intervals") 
  .withColumn("intervals", F.explode("intervals")) 
  .select("intervals.*") 
  .groupBy("year", "month") 
  .agg(F.sum("cost")) 
  .orderBy("year", "month") 
  .show()

Finally the logic for the udf. This Python code is completely independant from Spark:

def calc_intervals(start, end, cost):
  import datetime
  from dateutil import parser 
  

  def last_day_of_month(any_day):
    next_month = any_day.replace(day=28) + datetime.timedelta(days=4)
    return next_month - datetime.timedelta(days=next_month.day)

  def monthlist(begin,end):
    result = []
    while True:
        if begin.month == 12:
            next_month = begin.replace(year=begin.year+1,month=1, day=1)
        else:
            next_month = begin.replace(month=begin.month+1, day=1)
        if next_month > end:
            break
        result.append ([begin,last_day_of_month(begin)])
        begin = next_month
    result.append ([begin,end])
    return result

  def cost_per_interval(invoice_start, invoice_end, interval_start, interval_end, cost):
    return (interval_start.year, interval_start.month,
      ((interval_end - interval_start).days+1)/ ((invoice_end-invoice_start).days+1)*cost)
      
  start_dt=parser.isoparse(str(start))
  end_dt=parser.isoparse(str(end))
  intervals=monthlist(start_dt, end_dt)
  return [cost_per_interval(start_dt, end_dt, i[0], i[1], cost)  for i in intervals]

Most parts of this function are taken from this answer.

The logic ignores the columns invoice_month, invoice_year and invoice_days and only uses start_date_key and end_date_key to calculate the intervals. My results differ a bit from the numbers in the question. I believe this is due to an off-by-one error either in the question or in the answer.

+----+-----+------------------+
|year|month|sum(cost)         |
+----+-----+------------------+
|2005|5    |439.67742919921875|
|2005|6    |30.322580337524414|
|2007|10   |33.33333206176758 |
|2007|11   |119.99999618530273|
|2007|12   |106.66666412353516|
|2010|2    |329.69696044921875|
|2010|3    |310.30303955078125|
|2012|7    |193.5483856201172 |
|2012|8    |6.451612949371338 |
|2013|6    |138.591552734375  |
|2013|7    |545.5281677246094 |
|2013|8    |435.8802795410156 |
|2014|4    |160.0             |
|2014|5    |440.0             |
+----+-----+------------------+
Answered By: werner

In PySpark, you could try the following. This creates a sequence of intersected months, then explodes them so that you could group on them. Then, a sequence of all the days is created and intersected day count for every month is calculated. Then, aggregating.

Input:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [(11, 2007, 20071022, 20071120, 30, 100),
     (12, 2007, 20071121, 20071220, 30, 160),
     ( 5, 2014, 20140423, 20140522, 30, 600),
     ( 5, 2005, 20050503, 20050602, 31, 470),
     ( 7, 2012, 20120702, 20120801, 31, 200),
     ( 7, 2013, 20130712, 20130812, 32, 300),
     ( 2, 2010, 20100212, 20100316, 33, 640),
     (12, 2013, 20130619, 20130828, 71, 820)],
    ['invoice_month', 'invoice_year', 'start_date_key', 'end_date_key', 'invoice_days', 'cost'])

Script:

start = "to_date(start_date_key, 'yyyyMMdd')"
end = "to_date(end_date_key, 'yyyyMMdd')"
month = F.expr(f"sequence(trunc({start}, 'MM'), trunc({end}, 'MM'), interval 1 month)")
df = df.withColumn('month', F.explode(month))

range_days = F.expr(f"sequence({start}, {end})")
intersect_days = F.array_intersect(range_days, F.expr("sequence(month, last_day(month))"))
df = df.withColumn('days', F.size(intersect_days))

df = (df
    .groupBy(F.date_format('month', 'yyyyMM').alias('year_month'))
    .agg(F.round(F.sum(F.col('days') / F.col('invoice_days') * F.col('cost')), 5).alias('cost'))
    .sort('year_month')
)
df.show()
# +----------+---------+
# |year_month|     cost|
# +----------+---------+
# |    200505|439.67742|
# |    200506| 30.32258|
# |    200710| 33.33333|
# |    200711|    120.0|
# |    200712|106.66667|
# |    201002|329.69697|
# |    201003|310.30303|
# |    201207|193.54839|
# |    201208|  6.45161|
# |    201306|138.59155|
# |    201307|545.52817|
# |    201308|435.88028|
# |    201404|    160.0|
# |    201405|    440.0|
# +----------+---------+
Answered By: ZygD