Compute maximum number of consecutive identical integers in array column

Question:

Consider the following:

df = spark.createDataFrame([
    [0, [1, 1, 4, 4, 4]],
    [1, [3, 2, 2, -4]],
    [2, [1, 1, 5, 5]],
    [3, [-1, -9, -9, -9, -9]]]
    ,
    ['id', 'array_col']
)

df.show()
'''
+---+--------------------+
| id|           array_col|
+---+--------------------+
|  0|     [1, 1, 4, 4, 4]|
|  1|       [3, 2, 2, -4]|
|  2|        [1, 1, 5, 5]|
|  3|[-1, -9, -9, -9, -9]|
+---+--------------------+
'''

The desired result would be:

'''
+---+--------------------+-------------------------+
| id|           array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
|  0|     [1, 1, 4, 4, 4]|                        3|
|  1|       [3, 2, 2, -4]|                        2|
|  2|        [1, 1, 5, 5]|                        2|
|  3|[-1, -9, -9, -9, -9]|                        4|
+---+--------------------+-------------------------+
'''

I’ve tried solving it by joining the array as a string, then doing a regexp_extract_all according to the regex I found here RegExp match repeated characters.

from pyspark.sql.functions import col, concat_ws, expr

df = df.withColumn('joined_str', concat_ws('', col('array_col')))
df.show()
'''
+---+--------------------+-----------+
| id|           array_col| joined_str|
+---+--------------------+-----------+
|  0|     [1, 1, 4, 4, 4]|      11444|
|  1|       [3, 2, 2, -4]|      322-4|
|  2|        [1, 1, 5, 5]|       1155|
|  3|[-1, -9, -9, -9, -9]| -1-9-9-9-9|
+---+--------------------+-----------+
'''

df = df.withColumn('regexp_extracted', expr('regexp_extract_all(joined_str, "([0-9])1*", 1)'))
df.show()
'''
+---+--------------------+----------+----------+----------------+
| id|           array_col| concat_ws|joined_str|regexp_extracted|
+---+--------------------+----------+----------+----------------+
|  0|     [1, 1, 4, 4, 4]|     11444|     11444| [1, 1, 4, 4, 4]|
|  1|       [3, 2, 2, -4]|     322-4|     322-4|    [3, 2, 2, 4]|
|  2|        [1, 1, 5, 5]|      1155|      1155|    [1, 1, 5, 5]|
|  3|[-1, -9, -9, -9, -9]|-1-9-9-9-9|-1-9-9-9-9| [1, 9, 9, 9, 9]|
+---+--------------------+----------+----------+----------------+
'''

But then I got stuck because of 3 problems:

  • Negative numbers would be a problem to match
  • Numbers with more the 1 digit would be a problem to match
  • Even if all numbers were between 0-9, the regex doesn’t seem to be working
Asked By: L. B.

||

Answers:

I would suggest this function instead of regular expressions:

def max_consequtive(arr):
    retval = []
    running_value = None; running_count = 0;
    for running_item in arr:
        if running_value != running_item:
            if running_count != 0:
                retval.append(running_count)
            running_value = running_item; running_count = 1
        else:
             running_count += 1
    retval.append(running_count)
    return max(retval)

## Test

max_consequtive([-1, -9, -9, -9, -9])
4
max_consequtive([3, 2, 2, -4])
2
Answered By: Stefanov.sm

Using aggregate to iterate over the arrays and collect the required information:

df.withColumn('max_consecutive_identical', F.expr("""
    aggregate(
        array_col,
        (cast(1 as bigint), cast(1 as bigint), cast(null as bigint)),
        (acc, x) -> (
            if(acc.col3 = x, acc.col1 + 1, cast(1 as bigint)), 
            if(acc.col3 = x and acc.col1 + 1 > acc.col2, acc.col1 + 1, acc.col2), 
            x)
    ).col2""")).show()

The idea is to use a struct of 3 bigints (with elements named col1, col2 and col3) as acc variable.

  • col3: contains the array element seen in the previous iteration
  • col1: counter of how many values of x we have seen so far. Will be reset to 1 if the current value and the previous value do not match
  • col2: the maximum of all col1 values seen so far.
    After the last iteration, col2 contains the maximum number of consecutive identical integers in the array.

Output:

+---+--------------------+-------------------------+
| id|           array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
|  0|     [1, 1, 4, 4, 4]|                        3|
|  1|       [3, 2, 2, -4]|                        2|
|  2|        [1, 1, 5, 5]|                        2|
|  3|[-1, -9, -9, -9, -9]|                        4|
+---+--------------------+-------------------------+
Answered By: werner
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.