Is it possible to override just one column type when using PySpark to read in a CSV?
Question:
I’m trying to use PySpark to read in a CSV file with many columns. The inferschema
option is great at inferring majority of the columns’ data types. If I want to override just one of the columns types that were inferred incorrectly, what is the best way to do this?
I have this code working, but it makes PySpark import only the one column that is specified in the schema, which is not want I want.
schema = StructType()
.add("column_one_of_many", StringType(), True)
spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('inferschema', 'true')
.schema(self.schema)
.load('dbfs:/FileStore/some.csv')
Is what I’m asking for even possible?
Thank you for your time and guidance 🙂
Answers:
Easier way would be using .withColumn
and casting column_one_of_many
as string.
Example
from pyspark.sql.types import *
spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('inferschema', 'true')
.load('dbfs:/FileStore/some.csv')
.withColumn("column_one_of_many",col("column_one_of_many").cast("string"))
Other way would be defining all the columns in schema then exclude the inferschema
just use .schema
option to read the csv file.
Or you could read it in first with inferSchema
turned on, modify the schema, then load in the csv again:
from pyspark.sql.types import *
df = spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('inferschema', 'true')
.load('dbfs:/FileStore/some.csv')
For example, let’s change the first column (indexed by 0
) from IntegerType
to StringType
:
df.schema.fields[0].dataType = StringType
schema = df.schema
Then, reload the csv with the modified schema:
df = spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('schema', schema)
.load('dbfs:/FileStore/some.csv')
Of course, you can modify any of the columns’ data type this way by selecting the correct index. This method is better than casting in cases where you might lose info by the original type inference (e.g. zip codes that start with a zero).
I’m trying to use PySpark to read in a CSV file with many columns. The inferschema
option is great at inferring majority of the columns’ data types. If I want to override just one of the columns types that were inferred incorrectly, what is the best way to do this?
I have this code working, but it makes PySpark import only the one column that is specified in the schema, which is not want I want.
schema = StructType()
.add("column_one_of_many", StringType(), True)
spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('inferschema', 'true')
.schema(self.schema)
.load('dbfs:/FileStore/some.csv')
Is what I’m asking for even possible?
Thank you for your time and guidance 🙂
Easier way would be using .withColumn
and casting column_one_of_many
as string.
Example
from pyspark.sql.types import *
spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('inferschema', 'true')
.load('dbfs:/FileStore/some.csv')
.withColumn("column_one_of_many",col("column_one_of_many").cast("string"))
Other way would be defining all the columns in schema then exclude the inferschema
just use .schema
option to read the csv file.
Or you could read it in first with inferSchema
turned on, modify the schema, then load in the csv again:
from pyspark.sql.types import *
df = spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('inferschema', 'true')
.load('dbfs:/FileStore/some.csv')
For example, let’s change the first column (indexed by 0
) from IntegerType
to StringType
:
df.schema.fields[0].dataType = StringType
schema = df.schema
Then, reload the csv with the modified schema:
df = spark.read.format('com.databricks.spark.csv')
.option('delimited',',')
.option('header','true')
.option('schema', schema)
.load('dbfs:/FileStore/some.csv')
Of course, you can modify any of the columns’ data type this way by selecting the correct index. This method is better than casting in cases where you might lose info by the original type inference (e.g. zip codes that start with a zero).