How to Compare rows values in Pyspark using leadlag?

Question:

I have a dataframe having Column Name as ‘YEAR’,i want to check if the alternate rows of the column are matching and update another Column ‘FLAG’ with value as 100 if the alternate value matches.

df_prod

Year  FLAG      
2020  None
2020  None
2019  None 
2021  None
2021  None
2022  None

Expected Output
**

    Year  FLAG      
    2019  None
    2020  None
    2020  100
    2021  None
    2021  100
    2022  None

**

Asked By: BigData Lover

||

Answers:

The following snippet, which uses Windowing function, should do that for you:

from pyspark.sql.window import Window
from pyspark.sql.functions import col, lag, when

df = spark.createDataFrame([(2020, None), (2020, None), (2019, None), (2021, None), (2021, None), (2022, None)], "Year: int, FLAG: int")

window = Window.partitionBy().orderBy("Year")

df.withColumn("FLAG", when(col("Year") == lag(col("Year")).over(window), 100)).show()

+----+----+
|Year|FLAG|
+----+----+
|2019|null|
|2020|null|
|2020| 100|
|2021|null|
|2021| 100|
|2022|null|
+----+----+
Answered By: Bartosz Gajda