Need to add sequential numbering as per the grouping in Pyspark

Question:

I am working on one code where I need to add sequential number as per the grouping on the basis of column A & column B. Below is the table/dataframe I have. The data is sorted by colA & Date.

colA colB Date
A 1 01-01-2014
A 1 01-02-2014
A 3 30-04-2014
A 3 05-05-2014
A 2 25-05-2014
A 1 06-06-2014
A 1 21-07-2014
B 1 04-09-2014
B 1 19-10-2014
B 1 03-12-2014
C 3 17-01-2015
C 2 03-03-2015
C 2 17-04-2015

Following is the expected result:

colA colB Date ROWNUM
A 1 01-01-2014 1
A 1 01-02-2014 2
A 3 30-04-2014 1
A 3 05-05-2014 2
A 2 25-05-2014 1
A 1 06-06-2014 1
A 1 21-07-2014 2
B 1 04-09-2014 1
B 1 19-10-2014 2
B 1 03-12-2014 3
C 3 17-01-2015 1
C 2 03-03-2015 1
C 2 17-04-2015 2

I am trying to use row_number here and getting the below result where for A(colA) the next occurrence of 1(colB) has rownumber added as 3 instead of 1:

colA colB Date ROWNUM
A 1 01-01-2014 1
A 1 01-02-2014 2
A 3 30-04-2014 1
A 3 05-05-2014 2
A 2 25-05-2014 1
A 1 06-06-2014 3
A 1 21-07-2014 4
B 1 04-09-2014 1
B 1 19-10-2014 2
B 1 03-12-2014 3
C 3 17-01-2015 1
C 2 03-03-2015 1
C 2 17-04-2015 2
Asked By: Minal

||

Answers:

This would work:

w=Window.orderBy(F.asc("colA"), F.asc("Date"))

df
.withColumn("eq", F.when(F.lag("colB").over(w)==F.col("colB"), 0).otherwise(1))
.withColumn("groups", F.sum("eq").over(w))
.withColumn("row_num", F.row_number().over(w.partitionBy(F.col("colA"), F.col("groups"))))
.show()

Thanks to this answer: Window functions: PARTITION BY one column after ORDER BY another

Input:

+----+----+----------+
|colA|colB|      Date|
+----+----+----------+
|   A|   1|2014-01-01|
|   A|   1|2014-02-01|
|   A|   3|2014-04-30|
|   A|   3|2014-05-05|
|   A|   2|2014-05-25|
|   A|   1|2014-06-06|
|   A|   1|2014-07-21|
|   B|   1|2014-09-04|
|   B|   1|2014-10-19|
|   B|   1|2014-12-03|
|   C|   3|2015-01-17|
|   C|   2|2015-03-03|
|   C|   2|2015-04-17|
+----+----+----------+

Output (Keeping the intermediate columns for understanding, you can drop them):

+----+----+----------+---+------+-------+
|colA|colB|      Date| eq|groups|row_num|
+----+----+----------+---+------+-------+
|   A|   1|2014-01-01|  1|     1|      1|
|   A|   1|2014-02-01|  0|     1|      2|
|   A|   3|2014-04-30|  1|     2|      1|
|   A|   3|2014-05-05|  0|     2|      2|
|   A|   2|2014-05-25|  1|     3|      1|
|   A|   1|2014-06-06|  1|     4|      1|
|   A|   1|2014-07-21|  0|     4|      2|
|   B|   1|2014-09-04|  0|     4|      1|
|   B|   1|2014-10-19|  0|     4|      2|
|   B|   1|2014-12-03|  0|     4|      3|
|   C|   3|2015-01-17|  1|     5|      1|
|   C|   2|2015-03-03|  1|     6|      1|
|   C|   2|2015-04-17|  0|     6|      2|
+----+----+----------+---+------+-------+
Answered By: Ronak Jain