PySpark – assigning group id based on group member count

Question:

I have a dataframe where I want to assign id in for each window partition and for each 5 rows. Meaning, the id should increase/change when the partition has a different value or the number of rows in a partition is more than 5.

Input:

id | group |
1  |   A   |
2  |   A   |
3  |   A   |
4  |   A   |
5  |   A   |
6  |   A   |
7  |   A   |
8  |   A   |
9  |   B   |
10 |   B   |
11 |   C   |
12 |   C   |

Expected output:

id | group | group_id 
1  |   A   | 1
2  |   A   | 1
3  |   A   | 1
4  |   A   | 1
5  |   A   | 1
6  |   A   | 2
7  |   A   | 2
8  |   A   | 2
9  |   B   | 3
10 |   B   | 3
11 |   C   | 4
12 |   C   | 4

I tried a dense_rank approach, followed by udf. However, I cannot figure out how to carry over the previous rank value if I need to change the rank due to "for each 5 rows" constraint.

import pyspark.sql.functions as F
from pyspark.sql.window import Window
from pyspark.sql.types import IntegerType

window = Window.partitionBy('group').orderBy('id')
df_rank = df.withColumn('group_rank', F.dense_rank().over(window))

group_size = 5
# +1 in group_size to include 5, and +1 after the division to avoid rank 0
map_func = F.udf(lambda x: int(x / (group_size+1)) + 1, IntegerType())
df_rank_map = df_rank.withColumn('group_id', map_func(df_rank.group_rank))

Output:

id | group | group_rank | group_id
1  |   A   | 1          | 1
2  |   A   | 2          | 1
3  |   A   | 3          | 1
4  |   A   | 4          | 1
5  |   A   | 5          | 1
6  |   A   | 6          | 2
7  |   A   | 7          | 2
8  |   A   | 8          | 2
9  |   B   | 1          | 1
10 |   B   | 2          | 1
11 |   C   | 1          | 1
12 |   C   | 2          | 1

From my code, there are 2 problems. The obvious one is that the group_id is no where, what I want. I have yet to figure out a logic to carry over the previous rank to the next group.
The second problem is that, due to the udf, this logic is really slow. It would be really nice, if there is a way that also improve the performance.

I have been pulling my hair to figure out a way to do this. However, to no avail. Any idea or hint how to achieve the expected output?

Asked By: ArVe

||

Answers:

You can assign each distinct value of group a number, multiply this number with a sufficient large constant and then add the integer part of group_rank / 5.

First assign each element of the group column a unique number:

group_idx=df.select('group').distinct().withColumn('group_idx', F.dense_rank().over(Window.orderBy('group')))

Result:

+-----+---------+
|group|group_idx|
+-----+---------+
|    A|        1|
|    B|        2|
|    C|        3|
+-----+---------+

Now calculate the group id:

window = Window.partitionBy('group').orderBy('id')
df.withColumn('group_rank', F.dense_rank().over(window)) 
  .withColumn("idx_within_group", (F.col('group_rank') / F.lit(5)).cast('int')) 
  .join(group_idx, 'group') 
  .withColumn('group_id', F.col('group_idx') * 100 + F.col('idx_within_group')) 
  .select('id', 'group', 'group_id') 
  .orderBy('id') 
  .show()

Output:

+---+-----+--------+
| id|group|group_id|
+---+-----+--------+
|  1|    A|     100|
|  2|    A|     100|
|  3|    A|     100|
|  4|    A|     100|
|  5|    A|     101|
|  6|    A|     101|
|  7|    A|     101|
|  8|    A|     101|
|  9|    B|     200|
| 10|    B|     200|
| 11|    C|     300|
| 12|    C|     300|
+---+-----+--------+

Using 100 as multiplier is a bit arbitary. This number has to be larger than the (number of elements in the largest group) / 5.
If you need strictly consecutive numbers as group_ids this approach won’t work.

Answered By: werner

Create a dataframe with the sample values(assuming you have initialized spark)

data  = [(9,"B"), (10,"B"), (11,"C"), (12,"C"), (1,"A"), (2,"A"), (3,"A"), (4,"A"), (5,"A"), (6,"A") , (7,"A"),(8,"A")]

columns = ["id","group"]

df_org = spark.createDataFrame(data=data, schema = columns)

Import all the required libraries

from pyspark.sql.window import Window
from pyspark.sql.functions import *

Create a reference dataframe to calculate the count

windowspec1 = Window.partitionBy('group').orderBy('group')
df1 = df.withColumn('group_count',count('group').over(windowspec1))

Then take out distinct values from that and apply monotonically_increasing_id()

df1_temp = df1.select('group','group_count').distinct()
windowspec2 = Window.partitionBy('group_count').orderBy('group_count')
df1_temp = df1_temp.withColumn('rank',monotonically_increasing_id()+1)

Finally, join the reference dataframe with the main dataframe that you had.

df_final = df_org.join(df1_temp,on = ['group'],how ='left')
df_final.show()

This is the sample output:
enter image description here

Answered By: Banu