Pyspark 3.3.0 dataframe show data but writing CSV creates empty file
Question:
Facing a very unusual issue. Dataframe shows data if ran df.show()
however, when trying to write as csv, operation completes without error , but writes 0 byte empty file.
Is this a bug ? Is there something missing?
–pyspark version
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.3.0
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_352
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark
–Python Version
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
–dataframe shows data if fetched
>>> result.show()
+--------------------+
| review_keywords|
+--------------------+
| [love, echo]|
| [loved, it]|
|[sometimes, playi...|
|[lot, fun, thing,...|
| [music]|
|[received, echo, ...|
|[without, cellpho...|
|[think, 5th, one,...|
| [looks, great]|
|[love, it, i’ve, ...|
|[sent, 85, year, ...|
|[love, it, learni...|
|[purchased, mothe...|
|[love, love, love, ]|
| [expected]|
|[love, it, wife, ...|
|[really, happy, p...|
|[using, alexa, co...|
|[love, size, 2nd,...|
|[liked, original,...|
+--------------------+
only showing top 20 rows
–however write operation creates 0 byte empty file
>>> result.withColumn('review_keywords', col('review_keywords').cast('string')).write.option("header", "true").mode('overwrite').csv("hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt")
–hdfs file gets created but 0 byte
$ hadoop fs -ls hdfs:///tmp/some_dir/some_other_dir/
Found 2 items
drwxr-xr-x - xyz supergroup 0 2023-03-30 09:18 hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt
Answers:
What Spark really does when doing df.write.csv
is writing away a directory, not a file. As is discussed in this SO question, hadoop fs -ls
displays directory disk usage as 0.
If you’re interested in the size of the file you’ve just written, try using hadoop fs -dus hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt
. More info on that here.
Facing a very unusual issue. Dataframe shows data if ran df.show()
however, when trying to write as csv, operation completes without error , but writes 0 byte empty file.
Is this a bug ? Is there something missing?
–pyspark version
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/_,_/_/ /_/_ version 3.3.0
/_/
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_352
Branch HEAD
Compiled by user ubuntu on 2022-06-09T19:58:58Z
Revision f74867bddfbcdd4d08076db36851e88b15e66556
Url https://github.com/apache/spark
–Python Version
Python 3.9.13 (main, Aug 25 2022, 23:26:10)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
–dataframe shows data if fetched
>>> result.show()
+--------------------+
| review_keywords|
+--------------------+
| [love, echo]|
| [loved, it]|
|[sometimes, playi...|
|[lot, fun, thing,...|
| [music]|
|[received, echo, ...|
|[without, cellpho...|
|[think, 5th, one,...|
| [looks, great]|
|[love, it, i’ve, ...|
|[sent, 85, year, ...|
|[love, it, learni...|
|[purchased, mothe...|
|[love, love, love, ]|
| [expected]|
|[love, it, wife, ...|
|[really, happy, p...|
|[using, alexa, co...|
|[love, size, 2nd,...|
|[liked, original,...|
+--------------------+
only showing top 20 rows
–however write operation creates 0 byte empty file
>>> result.withColumn('review_keywords', col('review_keywords').cast('string')).write.option("header", "true").mode('overwrite').csv("hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt")
–hdfs file gets created but 0 byte
$ hadoop fs -ls hdfs:///tmp/some_dir/some_other_dir/
Found 2 items
drwxr-xr-x - xyz supergroup 0 2023-03-30 09:18 hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt
What Spark really does when doing df.write.csv
is writing away a directory, not a file. As is discussed in this SO question, hadoop fs -ls
displays directory disk usage as 0.
If you’re interested in the size of the file you’ve just written, try using hadoop fs -dus hdfs:///tmp/some_dir/some_other_dir/word_tokens.txt
. More info on that here.