# Compute maximum number of consecutive identical integers in array column

## Question:

Consider the following:

``````df = spark.createDataFrame([
[0, [1, 1, 4, 4, 4]],
[1, [3, 2, 2, -4]],
[2, [1, 1, 5, 5]],
[3, [-1, -9, -9, -9, -9]]]
,
['id', 'array_col']
)

df.show()
'''
+---+--------------------+
| id|           array_col|
+---+--------------------+
|  0|     [1, 1, 4, 4, 4]|
|  1|       [3, 2, 2, -4]|
|  2|        [1, 1, 5, 5]|
|  3|[-1, -9, -9, -9, -9]|
+---+--------------------+
'''
``````

The desired result would be:

``````'''
+---+--------------------+-------------------------+
| id|           array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
|  0|     [1, 1, 4, 4, 4]|                        3|
|  1|       [3, 2, 2, -4]|                        2|
|  2|        [1, 1, 5, 5]|                        2|
|  3|[-1, -9, -9, -9, -9]|                        4|
+---+--------------------+-------------------------+
'''
``````

I’ve tried solving it by joining the array as a string, then doing a `regexp_extract_all` according to the regex I found here RegExp match repeated characters.

``````from pyspark.sql.functions import col, concat_ws, expr

df = df.withColumn('joined_str', concat_ws('', col('array_col')))
df.show()
'''
+---+--------------------+-----------+
| id|           array_col| joined_str|
+---+--------------------+-----------+
|  0|     [1, 1, 4, 4, 4]|      11444|
|  1|       [3, 2, 2, -4]|      322-4|
|  2|        [1, 1, 5, 5]|       1155|
|  3|[-1, -9, -9, -9, -9]| -1-9-9-9-9|
+---+--------------------+-----------+
'''

df = df.withColumn('regexp_extracted', expr('regexp_extract_all(joined_str, "([0-9])1*", 1)'))
df.show()
'''
+---+--------------------+----------+----------+----------------+
| id|           array_col| concat_ws|joined_str|regexp_extracted|
+---+--------------------+----------+----------+----------------+
|  0|     [1, 1, 4, 4, 4]|     11444|     11444| [1, 1, 4, 4, 4]|
|  1|       [3, 2, 2, -4]|     322-4|     322-4|    [3, 2, 2, 4]|
|  2|        [1, 1, 5, 5]|      1155|      1155|    [1, 1, 5, 5]|
|  3|[-1, -9, -9, -9, -9]|-1-9-9-9-9|-1-9-9-9-9| [1, 9, 9, 9, 9]|
+---+--------------------+----------+----------+----------------+
'''
``````

But then I got stuck because of 3 problems:

• Negative numbers would be a problem to match
• Numbers with more the 1 digit would be a problem to match
• Even if all numbers were between 0-9, the regex doesn’t seem to be working

I would suggest this function instead of regular expressions:

``````def max_consequtive(arr):
retval = []
running_value = None; running_count = 0;
for running_item in arr:
if running_value != running_item:
if running_count != 0:
retval.append(running_count)
running_value = running_item; running_count = 1
else:
running_count += 1
retval.append(running_count)
return max(retval)

## Test

max_consequtive([-1, -9, -9, -9, -9])
4
max_consequtive([3, 2, 2, -4])
2
``````

Using aggregate to iterate over the arrays and collect the required information:

``````df.withColumn('max_consecutive_identical', F.expr("""
aggregate(
array_col,
(cast(1 as bigint), cast(1 as bigint), cast(null as bigint)),
(acc, x) -> (
if(acc.col3 = x, acc.col1 + 1, cast(1 as bigint)),
if(acc.col3 = x and acc.col1 + 1 > acc.col2, acc.col1 + 1, acc.col2),
x)
).col2""")).show()
``````

The idea is to use a struct of 3 bigints (with elements named `col1`, `col2` and `col3`) as `acc` variable.

• `col3`: contains the array element seen in the previous iteration
• `col1`: counter of how many values of `x` we have seen so far. Will be reset to 1 if the current value and the previous value do not match
• `col2`: the maximum of all `col1` values seen so far.
After the last iteration, `col2` contains the maximum number of consecutive identical integers in the array.

Output:

``````+---+--------------------+-------------------------+
| id|           array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
|  0|     [1, 1, 4, 4, 4]|                        3|
|  1|       [3, 2, 2, -4]|                        2|
|  2|        [1, 1, 5, 5]|                        2|
|  3|[-1, -9, -9, -9, -9]|                        4|
+---+--------------------+-------------------------+
``````
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.