Moving files from one directory to another directory in HDFS using Pyspark

Question

I am trying to read data all the JSON files from one directory and storing them in Spark Dataframe using the code below. (it works fine)

spark = SparkSession.builder.getOrCreate()


df = spark.read.json("hdfs:///user/temp/backup_data/st_in_*/*/*.json",multiLine=True)

but when I try to save the DataFrame with multiple files, using the code below

df.write.json("hdfs:///user/another_dir/to_save_dir/")

It doesn’t store the files as expected and throws error like to_save_dir already exists

I just want to save the files just like I read it from source dir to destination dir.

edit:

The problem is that, when i read multiple files and want to write it in a directory, what is the procedure in Pyspark? The reason i am asking this is because once the spark load all the files it creates a single dataframe, and each file is a row in this dataframe, how should i proceed to create new file for each of the rows in dataframe

Asked By: Danial Shabbir

||

Source

Answer 1

The error you get is quite clear, it seems the location you’re trying to write into already exists. You can choose to overwrite it by specifying the mode :

df.write.mode("overwrite").json("hdfs:///user/another_dir/to_save_dir/")

However, if your intent is to only move files from one location to another in HDFS, you don’t need to read the files in Spark and then write them. Instead, try using Hadoop FS API:

conf = sc._jsc.hadoopConfiguration()
Path = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileUtil = sc._gateway.jvm.org.apache.hadoop.fs.FileUtil

src_path = Path(src_folder)
dest_path = Path(dest_folder)

FileUtil.copy(src_path.getFileSystem(conf), 
              src_path,
              dest_path.getFileSystem(conf),
              dest_path,
              True,
              conf)

Answered By: blackbishop

Moving files from one directory to another directory in HDFS using Pyspark

Question:

Answers: