Calling udf is not working on spark dataframe
Question:
I have a dictionary and function I defined and registered a udf as a SQL function
%%spark
d = {'I':'Ice', 'U':'UN', 'T':'Tick'}
def key_to_val(k):
if k in d:
return d[k]
else:
return "Null"
spark.udf.register('key_to_val', key_to_val,StringType())
And I have spark dataframe that looks like
sdf =
+----+------------+--------------+
|id |date |Num |
+----+------------+--------------+
|I |2012-01-03 |1 |
|C |2013-01-11 |2 |
+----+------------+--------------+
I wanted to apply the function I registered on sdf to change the value in "id" to the dictionary value if it exists. However, I keep getting an error.
An error was encountered:
'list' object is not callable
Traceback (most recent call last):
TypeError: 'list' object is not callable
The code I tried is
%%spark
sdf.withColumn('id', key_to_val(sdf.id))
Expected output is
+----+------------+--------------+
|id |date |Num |
+----+------------+--------------+
|Ice |2012-01-03 |1 |
|Null|2013-01-11 |2 |
+----+------------+--------------+
After trying the code
from pyspark.sql.functions import col, udf
key_to_val_udf = udf(key_to_val)
stocks_sdf.withColumn("org", key_to_val_udf(sdf.id)).show()
An error was encountered:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
for obj in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 200, in _batched
for item in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/util.py", line 87, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 2, in ticker_to_name
AttributeError: 'str' object has no attribute 'apply'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
for obj in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 200, in _batched
for item in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/util.py", line 87, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 2, in ticker_to_name
AttributeError: 'str' object has no attribute 'apply'
Answers:
You are not calling your udf the right way, it’s either register a udf and then call it inside .sql("..") query or create udf() on your function and then call it inside your .withColumn(), I fixed your code:
from pyspark.sql.functions import udf
d = {'I': 'Ice', 'U': 'UN', 'T': 'Tick'}
def key_to_val(k):
if k in d:
return d[k]
else:
return "Null"
key_to_val_udf = udf(key_to_val)
sdf = spark.createDataFrame([['I', '2012-01-03', 1], ['C', '2013-01-11', 2]], schema=['id', 'date', 'Num'])
sdf.withColumn('id', key_to_val_udf(sdf.id)).show()
+----+----------+---+
| id| date|Num|
+----+----------+---+
| Ice|2012-01-03| 1|
|Null|2013-01-11| 2|
+----+----------+---+
I have a dictionary and function I defined and registered a udf as a SQL function
%%spark
d = {'I':'Ice', 'U':'UN', 'T':'Tick'}
def key_to_val(k):
if k in d:
return d[k]
else:
return "Null"
spark.udf.register('key_to_val', key_to_val,StringType())
And I have spark dataframe that looks like
sdf =
+----+------------+--------------+
|id |date |Num |
+----+------------+--------------+
|I |2012-01-03 |1 |
|C |2013-01-11 |2 |
+----+------------+--------------+
I wanted to apply the function I registered on sdf to change the value in "id" to the dictionary value if it exists. However, I keep getting an error.
An error was encountered:
'list' object is not callable
Traceback (most recent call last):
TypeError: 'list' object is not callable
The code I tried is
%%spark
sdf.withColumn('id', key_to_val(sdf.id))
Expected output is
+----+------------+--------------+
|id |date |Num |
+----+------------+--------------+
|Ice |2012-01-03 |1 |
|Null|2013-01-11 |2 |
+----+------------+--------------+
After trying the code
from pyspark.sql.functions import col, udf
key_to_val_udf = udf(key_to_val)
stocks_sdf.withColumn("org", key_to_val_udf(sdf.id)).show()
An error was encountered:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
for obj in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 200, in _batched
for item in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/util.py", line 87, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 2, in ticker_to_name
AttributeError: 'str' object has no attribute 'apply'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 485, in show
print(self._jdf.showString(n, 20, vertical))
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
for obj in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/serializers.py", line 200, in _batched
for item in iterator:
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/mnt1/yarn/usercache/livy/appcache/application_1678821388138_0001/container_1678821388138_0001_01_000008/pyspark.zip/pyspark/util.py", line 87, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 2, in ticker_to_name
AttributeError: 'str' object has no attribute 'apply'
You are not calling your udf the right way, it’s either register a udf and then call it inside .sql("..") query or create udf() on your function and then call it inside your .withColumn(), I fixed your code:
from pyspark.sql.functions import udf
d = {'I': 'Ice', 'U': 'UN', 'T': 'Tick'}
def key_to_val(k):
if k in d:
return d[k]
else:
return "Null"
key_to_val_udf = udf(key_to_val)
sdf = spark.createDataFrame([['I', '2012-01-03', 1], ['C', '2013-01-11', 2]], schema=['id', 'date', 'Num'])
sdf.withColumn('id', key_to_val_udf(sdf.id)).show()
+----+----------+---+
| id| date|Num|
+----+----------+---+
| Ice|2012-01-03| 1|
|Null|2013-01-11| 2|
+----+----------+---+