How to change csv file name while writing in spark?

Question:

I’m trying rename file in my code

from pyspark.sql import *
from IPython.core.display import display, HTML

display(HTML("<style>.container { width:100% !important; }</style>"))

spark = SparkSession 
    .builder 
    .appName("Python Spark SQL basic example") 
    .config("spark.some.config.option") 
    .getOrCreate()
    
df = spark.read.csv("../work/data2/*.csv", inferSchema=True, header=False)

df.createOrReplaceTempView("iris")
result = spark.sql("select * from iris where _c1 =2 order by _c0 ")
summary=result.describe(['_c10'])
summary.show()
summary.coalesce(1).write.csv("202003/data1_0331.csv")

.write.csv("202003/data1_0331.csv") in this code my spark creates everything folder

Result

"202003/data1_0331.csv/part-00000-3afd3298-a186-4289-8ba3-3bf55d27953f-c000.csv

The result i want is

202003/data1_0331.csv

How do I get the results I want?
I saw a similar solution here like this write.csv(summary,file="data1_0331")
but i got this error

cannot resolve '`0`' given input columns
Asked By: powpow

||

Answers:

Spark uses parallelism to speed up computation, so it’s normal that Spark tries to write multiple files for one CSV, it will speed up the reading part.

So if you only use Spark: keep it that way, it will be faster.

However if you really want to save your data as a single CSV file, you can use pandas with something like this:

summary.toPandas().to_csv("202003/data1_0331.csv")

Answered By: Be Chiller Too

You cannot control the name of the output of write Spark operation.

However, you can always rename it:

from py4j.java_gateway import java_import

java_import(spark._jvm, 'org.apache.hadoop.fs.Path')

fs = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration())

list_status = fs.listStatus(spark._jvm.org.apache.hadoop.fs.Path(CSVPath))

file_name = [file.getPath().getName() for file in list_status if file.getPath().getName().startswith('part-')][0]

print(file_name)

fs.rename(sc._jvm.Path(CSVPath+''+file_name), sc._jvm.Path(CSVPath+"data1_0331.csv"))

This code will list all files in your output path and looks for files starting with part- and rename them to desired name.

Answered By: Haha

While writing the file using pyspark we cannot forcefully change the name of the file, the only way is after writing the file we can rename it with the help of the function

source_path = "your source path"
destination_path = "your destination path"

def rename_file_with_location(source_path,destination_path,file_name):
    files = dbutils.fs.ls(source_path)
    csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
    file_name_csv=csv_file.split('/')[4]
    dbutils.fs.mv(csv_file, destination_path + file_name)
    print("File has been renamed from"+source_path+"to this"+destination_path+file_name)

with the help of this function you can rename the pyspark partitioned csv files.

Note:- This function only works with one csv file, you can alter it for multiple easily by changing the second line of the code or if you don’t want to change the code you can also write in one partition but it has its own disadvantage.

It can be done by using the .coalesce(1) function
Answered By: Tayyab Vohra
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.