Map values in ArrayType column with Spark dataframe

Question

I have a Spark dataframe with ArrayType column:

|id|neighbors|
+--+---------+
|a |[b,c]    |
|b |[a,d]    |
|c |[a]      |
|d |[b]      |
+--+---------+

I need to map values in this ArrayType column with initial dataframe.
Desired output:

|id|neighbors    |
+--+-------------+
|a |[[a,d],[a]]  |
|b |[[b,c],[b]]  |
|c |[[b,c]]      |
|d |[[a,d]]      |
+--+-------------+

What is the best way to handle this problem? I have very large amount of data (about 100 million records).

Asked By: Егор Климов

||

Source

Answer 1

You would need to explode the column ‘neighbors’ and then just join. Also, since this is a self join, it is recommended to use alias on dataframes.

Initial df:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [('a', ['b', 'c']),
     ('b', ['a', 'd']),
     ('c', ['a']),
     ('d', ['b'])],
    ['id', 'neighbors']
)

Transformation:

df = (
    df.withColumn('_neighbors', F.explode('neighbors')).alias('df1')
    .join(df.alias('df2'), F.col('df1._neighbors') == F.col('df2.id'))
    .groupBy('df1.id')
    .agg(F.collect_list('df2.neighbors').alias('neighbors'))
)
df.show()
# +---+-------------+
# | id|    neighbors|
# +---+-------------+
# |  d|     [[a, d]]|
# |  c|     [[b, c]]|
# |  b|[[b, c], [b]]|
# |  a|[[a, d], [a]]|
# +---+-------------+

Answered By: ZygD

Map values in ArrayType column with Spark dataframe

Question:

Answers: