How do I append new rows to a PySpark DataFrame guaranteeing a unique ID?

Question:

I have two PySpark DataFrame objects that I wish to concatenate. One of the DataFrames df_a has a column unique_id derived using pyspark.sql.functions.monotonically_increasing_id(). The other DataFrame, df_b does not. I want to append the rows of df_b to df_a, but I need to generate values for the unique_id column that do not coincide with any of the values in df_a.unique_id.

df_a = spark.createDataFrame(
    [
        (1, "a", 42949672960),
        (2, "b", 85899345920),
        (3, "c", 128849018880)
    ],
    ["number", "letter", "unique_id"]
)

df_b = spark.createDataFrame(
    [
        (3, "c"),
        (4, "c"),
        (5, "d")
    ],
    ["number", "letter"]
)
df_b = df_b.withColumn("unique_id", F.monotonically_increasing_id())

df = df_a.union(df_b)
df.show()

I looked to see if pyspark.sql.functions.monotonically_increasing_id() took a parameter enforcing a minimum value, but it does not.

One final thing to note, df_a is a massive DataFrame that needs to be appended to regularly. If I needed to reassign unique ids to df_a using a function other than pyspark.sql.functions.monotonically_increasing_id() to make a potential solution work long-term, I could do so once, but not every time I were to append new data.

Any direction would be appreciated—thank you!

Asked By: Clade

||

Answers:

You can always add a constant to monotonically_increasing_id():

n = df_a.select(F.max('unique_id').alias('max_n')).first().max_n
df_b = df_b.withColumn("unique_id", F.monotonically_increasing_id() + F.lit(n + 1))
Answered By: bzu