Pyspark efficiently create patterns within each window

Question

I want to create a base dataframe from the existing one, which does not contain all I want, for example, I have the dataframe collecting the number of candies each people (tracked by "id") bought each year-month (but in this case each person didn’t buy candies every month)

|id|year_month|num_of_candies_bought
  1  2022-01           5
  1  2022-03          10
  1  2022-04           2

What I want is that to track them by fix the year-month I’m interested like this (for the first 5 months this year)

|id|year_month|num_of_candies_bought
  1  2022-01           5
  1  2022-02           0
  1  2022-03          10
  1  2022-04           2
  1  2022-05           0

I think one way to do this is to use "crossjoin" but it turns out that this takes long time to process. Is there any way to do this without any join? In my work the first dataframe is very very huge (a million rows for instance) while the second is just fixed (like in this case only 5 rows) and much much smaller. Is it possible (if it is needed to use crossjoin) to improve performance drastically?

P.S. I want this to seperate each person (so I need to use window.partition thing)

Asked By: W. Wongcharoenbhorn

||

Source

Answer 1

I’d simply add a 0 (zero) line for each id and each id and year_month.
Let’s assume df is your dataframe.

from pyspark.sql import functions as F

# generate a list of all year_month you need
year_month = ["2022-01", "2022-02", "2022-03", "2022-04", "2022-05"]

df_id = (
    df.select("id")
    .distinct()
    .withColumn("num_of_candies_bought", F.lit(0))
    .withColumn("year_month", F.explode(F.array(*map(F.lit, year_month))))
)

df = (
    df.unionByName(df_id)
    .groupBy("id", "year_month")
    .agg(F.sum("num_of_candies_bought").alias("num_of_candies_bought"))
)

Answered By: Steven

Pyspark efficiently create patterns within each window

Question:

Answers: