Map values in ArrayType column with Spark dataframe


I have a Spark dataframe with ArrayType column:

|a |[b,c]    |
|b |[a,d]    |
|c |[a]      |
|d |[b]      |

I need to map values in this ArrayType column with initial dataframe.
Desired output:

|id|neighbors    |
|a |[[a,d],[a]]  |
|b |[[b,c],[b]]  |
|c |[[b,c]]      |
|d |[[a,d]]      |

What is the best way to handle this problem? I have very large amount of data (about 100 million records).


You would need to explode the column ‘neighbors’ and then just join. Also, since this is a self join, it is recommended to use alias on dataframes.

Initial df:

from pyspark.sql import functions as F

df = spark.createDataFrame(
    [('a', ['b', 'c']),
     ('b', ['a', 'd']),
     ('c', ['a']),
     ('d', ['b'])],
    ['id', 'neighbors']


df = (
    df.withColumn('_neighbors', F.explode('neighbors')).alias('df1')
    .join(df.alias('df2'), F.col('df1._neighbors') == F.col(''))
# +---+-------------+
# | id|    neighbors|
# +---+-------------+
# |  d|     [[a, d]]|
# |  c|     [[b, c]]|
# |  b|[[b, c], [b]]|
# |  a|[[a, d], [a]]|
# +---+-------------+
Answered By: ZygD