Pyspark cannot export large dataframe to csv. Session setup incorrect?

Question:

My session in pyspark 2.3:

spark = SparkSession
    .builder
    .appName("test_app")
    .config('spark.executor.instances','4')
    .config('spark.executor.cores', '4')
    .config('spark.executor.memory', '24g')
    .config('spark.driver.maxResultSize', '24g')
    .config('spark.rpc.message.maxSize', '512')
    .config('spark.yarn.executor.memoryOverhead', '10000')
    .enableHiveSupport()
    .getOrCreate()

I work on cloudera with a 32GB RAM session and handle dataframes containing approx. 30,000,000 rows and up to 20 columns. These dataframes consist of int, float and str data. My program is supposed to join several tables, format some data, describe the final result table and export it in csv format. I have issues exporting the data to csv. My approach throws the following error:

>>> final_dataframe.write.csv("export.csv")

22/11/30 15:08:50 216 ERROR TaskSetManager: Task 2 in stage 88.0 failed 4 times; aborting job
22/11/30 15:08:50 219 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 88.0 failed 4 times, most recent failure: Lost task 2.3 in stage 88.0 (TID 9514, bdwrkp124.cda.commerzbank.com, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/worker.py", line 253, in main
    process()
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/worker.py", line 248, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/serializers.py", line 331, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/serializers.py", line 140, in dump_stream
    for obj in iterator:
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/serializers.py", line 320, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/worker.py", line 76, in <lambda>
    return lambda *a: f(*a)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/sql/functions.py", line 42, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:83)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:66)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage14.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2039)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:154)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:664)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:664)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:664)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
    at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:652)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/worker.py", line 253, in main
    process()
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/worker.py", line 248, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/serializers.py", line 331, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/serializers.py", line 140, in dump_stream
    for obj in iterator:
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/serializers.py", line 320, in _batched
    for item in iterator:
  File "<string>", line 1, in <lambda>
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/worker.py", line 76, in <lambda>
    return lambda *a: f(*a)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/util.py", line 55, in wrapper
    return f(*args, **kwargs)
  File "/hadoop/disk09/hadoop/yarn/local/usercache/cb2rtor/appcache/application_1667977333442_395910/container_e323_1667977333442_395910_01_000003/pyspark.zip/pyspark/sql/functions.py", line 42, in _
    jc = getattr(sc._jvm.functions, name)(col._jc if isinstance(col, Column) else col)
AttributeError: 'NoneType' object has no attribute '_jvm'

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:330)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:83)
    at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.read(PythonUDFRunner.scala:66)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:284)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage14.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ... 1 more

What is the problem? How can I fix it? I guess my session setup is not adequate, but I lack the experience to see the issue.

Asked By: Viktor

||

Answers:

I’m not experienced with pyspark, but maybe this SO post can help you in some way. Have you started up your pyspark environment before running this code? The error AttributeError: 'NoneType' object has no attribute '_jvm' seems to hint at something wrong in the setup.

Also, unless you have very specific (and strong) reasons to write this large amount of data to a CSV file I would advise against it. Try writing to a parquet file (I suspect it would be simply final_dataframe.write.parquet("export.parquet") in pyspark)

Answered By: Koedlt

The problem had nothing to do with the session or output format! Yet, my solution only worked with a session with far more RAM. So my guess was wrong, but at least not entirely useless;) I will comment the details of the session in the question collect() or toPandas() on a large DataFrame in pyspark/EMR in the future.

I solved the issue by removing all @udf (pyspark.sql.functions.udf) functions from the pyspark code, which apparently broke the spark dataframe and made any I/O operation impossible. The reason for that was either faulty udf implementation on my side, or an issue with how spark executes udf on its cluster. Either way it seemed that None values were not handled correctly within udf. I then used toPandas() (which needs a lot of memory) to get the data to csv.

Answered By: Viktor

The problem had nothing to do with the session or output format! Yet, my solution only worked with a session with far more RAM. So my guess was wrong, but at least not entirely useless;) I will comment the details of the session in the question collect() or toPandas() on a large DataFrame in pyspark/EMR in the future.

Answered By: Sandeepkumar Chappa