Is there any function to locate all occurrences in a column of PySpark dataframe?

Question:

I have the following PySpark dataframe

A
1001
1110
1000

I want to find all occurrences of 1 in this column and put it into a new column in some way like this:

A Occurrences
1001 0,3
1110 0,1,2
1000 0

I tried the locate function as follows:

from pyspark.sql import functions as F
dfa_occ = dfa.withColumn('Occurrences', F.locate('1', (F.col('A'))-1) )

However this only extracts the first occurrence of 1, while I need to find all of them!

Asked By: Abdessamad139

||

Answers:

You could try posexplode and the aggregate result back into a list. split in this example splits the string into letters.

from pyspark.sql import functions as F
df = spark.createDataFrame([('1001',), ('1110',), ('1000',)], ['A'])

df = (df
    .select('*', F.posexplode(F.split('A', '(?!$)')))
    .filter('col = 1')
    .groupBy('A')
    .agg(F.collect_list('pos').alias('Occurrences'))
)
df.show()
# +----+-----------+
# |   A|Occurrences|
# +----+-----------+
# |1001|     [0, 3]|
# |1000|        [0]|
# |1110|  [0, 1, 2]|
# +----+-----------+
Answered By: ZygD

Here’s a way using array’s higher-order functions transform and filter:

from pyspark.sql import functions as F

result = df.withColumn(
    "Occurrences",
    F.filter(
        F.transform(F.split("A", ""), lambda x, i: F.when(x == 1, i)), 
        lambda x: x.isNotNull()
    )
)

result.show()
# +----+-----------+
# |   A|Occurrences|
# +----+-----------+
# |1001|     [0, 3]|
# |1110|  [0, 1, 2]|
# |1000|        [0]|
# +----+-----------+
Answered By: blackbishop