Compute maximum number of consecutive identical integers in array column
Question:
Consider the following:
df = spark.createDataFrame([
[0, [1, 1, 4, 4, 4]],
[1, [3, 2, 2, -4]],
[2, [1, 1, 5, 5]],
[3, [-1, -9, -9, -9, -9]]]
,
['id', 'array_col']
)
df.show()
'''
+---+--------------------+
| id| array_col|
+---+--------------------+
| 0| [1, 1, 4, 4, 4]|
| 1| [3, 2, 2, -4]|
| 2| [1, 1, 5, 5]|
| 3|[-1, -9, -9, -9, -9]|
+---+--------------------+
'''
The desired result would be:
'''
+---+--------------------+-------------------------+
| id| array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
| 0| [1, 1, 4, 4, 4]| 3|
| 1| [3, 2, 2, -4]| 2|
| 2| [1, 1, 5, 5]| 2|
| 3|[-1, -9, -9, -9, -9]| 4|
+---+--------------------+-------------------------+
'''
I’ve tried solving it by joining the array as a string, then doing a regexp_extract_all
according to the regex I found here RegExp match repeated characters.
from pyspark.sql.functions import col, concat_ws, expr
df = df.withColumn('joined_str', concat_ws('', col('array_col')))
df.show()
'''
+---+--------------------+-----------+
| id| array_col| joined_str|
+---+--------------------+-----------+
| 0| [1, 1, 4, 4, 4]| 11444|
| 1| [3, 2, 2, -4]| 322-4|
| 2| [1, 1, 5, 5]| 1155|
| 3|[-1, -9, -9, -9, -9]| -1-9-9-9-9|
+---+--------------------+-----------+
'''
df = df.withColumn('regexp_extracted', expr('regexp_extract_all(joined_str, "([0-9])1*", 1)'))
df.show()
'''
+---+--------------------+----------+----------+----------------+
| id| array_col| concat_ws|joined_str|regexp_extracted|
+---+--------------------+----------+----------+----------------+
| 0| [1, 1, 4, 4, 4]| 11444| 11444| [1, 1, 4, 4, 4]|
| 1| [3, 2, 2, -4]| 322-4| 322-4| [3, 2, 2, 4]|
| 2| [1, 1, 5, 5]| 1155| 1155| [1, 1, 5, 5]|
| 3|[-1, -9, -9, -9, -9]|-1-9-9-9-9|-1-9-9-9-9| [1, 9, 9, 9, 9]|
+---+--------------------+----------+----------+----------------+
'''
But then I got stuck because of 3 problems:
- Negative numbers would be a problem to match
- Numbers with more the 1 digit would be a problem to match
- Even if all numbers were between 0-9, the regex doesn’t seem to be working
Answers:
I would suggest this function instead of regular expressions:
def max_consequtive(arr):
retval = []
running_value = None; running_count = 0;
for running_item in arr:
if running_value != running_item:
if running_count != 0:
retval.append(running_count)
running_value = running_item; running_count = 1
else:
running_count += 1
retval.append(running_count)
return max(retval)
## Test
max_consequtive([-1, -9, -9, -9, -9])
4
max_consequtive([3, 2, 2, -4])
2
Using aggregate to iterate over the arrays and collect the required information:
df.withColumn('max_consecutive_identical', F.expr("""
aggregate(
array_col,
(cast(1 as bigint), cast(1 as bigint), cast(null as bigint)),
(acc, x) -> (
if(acc.col3 = x, acc.col1 + 1, cast(1 as bigint)),
if(acc.col3 = x and acc.col1 + 1 > acc.col2, acc.col1 + 1, acc.col2),
x)
).col2""")).show()
The idea is to use a struct of 3 bigints (with elements named col1
, col2
and col3
) as acc
variable.
col3
: contains the array element seen in the previous iteration
col1
: counter of how many values of x
we have seen so far. Will be reset to 1 if the current value and the previous value do not match
col2
: the maximum of all col1
values seen so far.
After the last iteration, col2
contains the maximum number of consecutive identical integers in the array.
Output:
+---+--------------------+-------------------------+
| id| array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
| 0| [1, 1, 4, 4, 4]| 3|
| 1| [3, 2, 2, -4]| 2|
| 2| [1, 1, 5, 5]| 2|
| 3|[-1, -9, -9, -9, -9]| 4|
+---+--------------------+-------------------------+
Consider the following:
df = spark.createDataFrame([
[0, [1, 1, 4, 4, 4]],
[1, [3, 2, 2, -4]],
[2, [1, 1, 5, 5]],
[3, [-1, -9, -9, -9, -9]]]
,
['id', 'array_col']
)
df.show()
'''
+---+--------------------+
| id| array_col|
+---+--------------------+
| 0| [1, 1, 4, 4, 4]|
| 1| [3, 2, 2, -4]|
| 2| [1, 1, 5, 5]|
| 3|[-1, -9, -9, -9, -9]|
+---+--------------------+
'''
The desired result would be:
'''
+---+--------------------+-------------------------+
| id| array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
| 0| [1, 1, 4, 4, 4]| 3|
| 1| [3, 2, 2, -4]| 2|
| 2| [1, 1, 5, 5]| 2|
| 3|[-1, -9, -9, -9, -9]| 4|
+---+--------------------+-------------------------+
'''
I’ve tried solving it by joining the array as a string, then doing a regexp_extract_all
according to the regex I found here RegExp match repeated characters.
from pyspark.sql.functions import col, concat_ws, expr
df = df.withColumn('joined_str', concat_ws('', col('array_col')))
df.show()
'''
+---+--------------------+-----------+
| id| array_col| joined_str|
+---+--------------------+-----------+
| 0| [1, 1, 4, 4, 4]| 11444|
| 1| [3, 2, 2, -4]| 322-4|
| 2| [1, 1, 5, 5]| 1155|
| 3|[-1, -9, -9, -9, -9]| -1-9-9-9-9|
+---+--------------------+-----------+
'''
df = df.withColumn('regexp_extracted', expr('regexp_extract_all(joined_str, "([0-9])1*", 1)'))
df.show()
'''
+---+--------------------+----------+----------+----------------+
| id| array_col| concat_ws|joined_str|regexp_extracted|
+---+--------------------+----------+----------+----------------+
| 0| [1, 1, 4, 4, 4]| 11444| 11444| [1, 1, 4, 4, 4]|
| 1| [3, 2, 2, -4]| 322-4| 322-4| [3, 2, 2, 4]|
| 2| [1, 1, 5, 5]| 1155| 1155| [1, 1, 5, 5]|
| 3|[-1, -9, -9, -9, -9]|-1-9-9-9-9|-1-9-9-9-9| [1, 9, 9, 9, 9]|
+---+--------------------+----------+----------+----------------+
'''
But then I got stuck because of 3 problems:
- Negative numbers would be a problem to match
- Numbers with more the 1 digit would be a problem to match
- Even if all numbers were between 0-9, the regex doesn’t seem to be working
I would suggest this function instead of regular expressions:
def max_consequtive(arr):
retval = []
running_value = None; running_count = 0;
for running_item in arr:
if running_value != running_item:
if running_count != 0:
retval.append(running_count)
running_value = running_item; running_count = 1
else:
running_count += 1
retval.append(running_count)
return max(retval)
## Test
max_consequtive([-1, -9, -9, -9, -9])
4
max_consequtive([3, 2, 2, -4])
2
Using aggregate to iterate over the arrays and collect the required information:
df.withColumn('max_consecutive_identical', F.expr("""
aggregate(
array_col,
(cast(1 as bigint), cast(1 as bigint), cast(null as bigint)),
(acc, x) -> (
if(acc.col3 = x, acc.col1 + 1, cast(1 as bigint)),
if(acc.col3 = x and acc.col1 + 1 > acc.col2, acc.col1 + 1, acc.col2),
x)
).col2""")).show()
The idea is to use a struct of 3 bigints (with elements named col1
, col2
and col3
) as acc
variable.
col3
: contains the array element seen in the previous iterationcol1
: counter of how many values ofx
we have seen so far. Will be reset to 1 if the current value and the previous value do not matchcol2
: the maximum of allcol1
values seen so far.
After the last iteration,col2
contains the maximum number of consecutive identical integers in the array.
Output:
+---+--------------------+-------------------------+
| id| array_col|max_consecutive_identical|
+---+--------------------+-------------------------+
| 0| [1, 1, 4, 4, 4]| 3|
| 1| [3, 2, 2, -4]| 2|
| 2| [1, 1, 5, 5]| 2|
| 3|[-1, -9, -9, -9, -9]| 4|
+---+--------------------+-------------------------+