Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Question

Facing a very unusual issue. Dataframe shows data if ran df.show() however, when trying to write as csv, operation completes without error , but writes 0 byte empty file.

Is this a bug ? Is there something missing?

–pyspark version

      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/_,_/_/ /_/_   version 3.3.0
      /_/

Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_352
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark

–Python Version

Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

–dataframe shows data if fetched

>>> result.show()
+--------------------+
|     review_keywords|
+--------------------+
|        [love, echo]|
|         [loved, it]|
|[sometimes, playi...|
|[lot, fun, thing,...|
|             [music]|
|[received, echo, ...|
|[without, cellpho...|
|[think, 5th, one,...|
|      [looks, great]|
|[love, it, i’ve, ...|
|[sent, 85, year, ...|
|[love, it, learni...|
|[purchased, mothe...|
|[love, love, love, ]|
|          [expected]|
|[love, it, wife, ...|
|[really, happy, p...|
|[using, alexa, co...|
|[love, size, 2nd,...|
|[liked, original,...|
+--------------------+
only showing top 20 rows

–however write operation creates 0 byte empty file

>>> result.withColumn('review_keywords', col('review_keywords').cast('string')).write.option("header", "true").mode('overwrite').csv("hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt")

–hdfs file gets created but 0 byte

$ hadoop fs -ls hdfs:///tmp/some_dir/some_other_dir/
Found 2 items
drwxr-xr-x   - xyz supergroup          0 2023-03-30 09:18 hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt

Asked By: StrangerThinks

||

Source

Answer 1

What Spark really does when doing df.write.csv is writing away a directory, not a file. As is discussed in this SO question, hadoop fs -ls displays directory disk usage as 0.

If you’re interested in the size of the file you’ve just written, try using hadoop fs -dus hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt. More info on that here.

Answered By: Koedlt

Pyspark 3.3.0 dataframe show data but writing CSV creates empty file

Question:

Answers: