Calling udf is not working on spark dataframe

Question:

I have a dictionary and function I defined and registered a udf as a SQL function

%%spark
d = {'I':'Ice', 'U':'UN', 'T':'Tick'}

def key_to_val(k):
  if k in d:
    return d[k]
  else:
    return "Null"

spark.udf.register('key_to_val', key_to_val,StringType())

And I have spark dataframe that looks like

sdf = 
+----+------------+--------------+
|id  |date        |Num           |
+----+------------+--------------+
|I   |2012-01-03  |1             |
|C   |2013-01-11  |2             |
+----+------------+--------------+

I wanted to apply the function I registered on sdf to change the value in "id" to the dictionary value if it exists. However, I keep getting an error.

An error was encountered:
'list' object is not callable
Traceback (most recent call last):
TypeError: 'list' object is not callable

The code I tried is

%%spark
sdf.withColumn('id', key_to_val(sdf.id))

Expected output is

+----+------------+--------------+
|id  |date        |Num           |
+----+------------+--------------+
|Ice |2012-01-03  |1             |
|Null|2013-01-11  |2             |
+----+------------+--------------+

After trying the code

from pyspark.sql.functions import col, udf
key_to_val_udf = udf(key_to_val)
stocks_sdf.withColumn("org", key_to_val_udf(sdf.id)).show()
An error was encountered:

  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
    for obj in iterator:
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 200, in _batched
    for item in iterator:
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/util.py", line 87, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 2, in ticker_to_name
AttributeError: 'str' object has no attribute 'apply'

Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
    for obj in iterator:
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 200, in _batched
    for item in iterator:
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/util.py", line 87, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 2, in ticker_to_name
AttributeError: 'str' object has no attribute 'apply'
Asked By: reksapj

||

Answers:

You are not calling your udf the right way, it’s either register a udf and then call it inside .sql("..") query or create udf() on your function and then call it inside your .withColumn(), I fixed your code:

from pyspark.sql.functions import udf

d = {'I': 'Ice', 'U': 'UN', 'T': 'Tick'}

def key_to_val(k):
    if k in d:
        return d[k]
    else:
        return "Null"

key_to_val_udf = udf(key_to_val)

sdf = spark.createDataFrame([['I', '2012-01-03', 1], ['C', '2013-01-11', 2]], schema=['id', 'date', 'Num'])
sdf.withColumn('id', key_to_val_udf(sdf.id)).show()


+----+----------+---+
|  id|      date|Num|
+----+----------+---+
| Ice|2012-01-03|  1|
|Null|2013-01-11|  2|
+----+----------+---+
Answered By: Abdennacer Lachiheb