How can I write a parquet file using Spark (pyspark)?
Question:
I’m pretty new in Spark and I’ve been trying to convert a Dataframe to a parquet file in Spark but I haven’t had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: ‘RDD’ object has no attribute ‘write’
from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")
# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")
Do you know how to make this work?
The spark version that I’m using is Spark 2.0.1 built for Hadoop 2.7.3.
Answers:
The error was due to the fact that the textFile
method from SparkContext
returned an RDD
and what I needed was a DataFrame
.
SparkSession has a SQLContext
under the hood. So I needed to use the DataFrameReader
to read the CSV file correctly before converting it to a parquet file.
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName("Protob Conversion to Parquet")
.config("spark.some.config.option", "some-value")
.getOrCreate()
# read csv
df = spark.read.csv("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.show()
df.write.parquet("output/proto.parquet")
You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.
Here’s the Koala code:
import databricks.koalas as ks
df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')
I’m pretty new in Spark and I’ve been trying to convert a Dataframe to a parquet file in Spark but I haven’t had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: ‘RDD’ object has no attribute ‘write’
from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")
# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")
Do you know how to make this work?
The spark version that I’m using is Spark 2.0.1 built for Hadoop 2.7.3.
The error was due to the fact that the textFile
method from SparkContext
returned an RDD
and what I needed was a DataFrame
.
SparkSession has a SQLContext
under the hood. So I needed to use the DataFrameReader
to read the CSV file correctly before converting it to a parquet file.
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.appName("Protob Conversion to Parquet")
.config("spark.some.config.option", "some-value")
.getOrCreate()
# read csv
df = spark.read.csv("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.show()
df.write.parquet("output/proto.parquet")
You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.
Here’s the Koala code:
import databricks.koalas as ks
df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')