Is it possible to override just one column type when using PySpark to read in a CSV?

Question:

I’m trying to use PySpark to read in a CSV file with many columns. The inferschema option is great at inferring majority of the columns’ data types. If I want to override just one of the columns types that were inferred incorrectly, what is the best way to do this?

I have this code working, but it makes PySpark import only the one column that is specified in the schema, which is not want I want.

schema = StructType()  
    .add("column_one_of_many", StringType(), True) 

spark.read.format('com.databricks.spark.csv')  
  .option('delimited',',')  
  .option('header','true')  
  .option('inferschema', 'true')  
  .schema(self.schema)  
  .load('dbfs:/FileStore/some.csv') 

Is what I’m asking for even possible?

Thank you for your time and guidance 🙂

Asked By: Zhao Li

||

Answers:

Easier way would be using .withColumn and casting column_one_of_many as string.

Example

from pyspark.sql.types import *

spark.read.format('com.databricks.spark.csv')  
  .option('delimited',',')  
  .option('header','true')  
  .option('inferschema', 'true')  
  .load('dbfs:/FileStore/some.csv')
  .withColumn("column_one_of_many",col("column_one_of_many").cast("string"))

Other way would be defining all the columns in schema then exclude the inferschema just use .schema option to read the csv file.

Answered By: notNull

Or you could read it in first with inferSchema turned on, modify the schema, then load in the csv again:

from pyspark.sql.types import *

df = spark.read.format('com.databricks.spark.csv')  
  .option('delimited',',')  
  .option('header','true')  
  .option('inferschema', 'true')  
  .load('dbfs:/FileStore/some.csv')

For example, let’s change the first column (indexed by 0) from IntegerType to StringType:

df.schema.fields[0].dataType = StringType
schema = df.schema

Then, reload the csv with the modified schema:

df = spark.read.format('com.databricks.spark.csv')  
  .option('delimited',',')  
  .option('header','true')  
  .option('schema', schema)  
  .load('dbfs:/FileStore/some.csv')

Of course, you can modify any of the columns’ data type this way by selecting the correct index. This method is better than casting in cases where you might lose info by the original type inference (e.g. zip codes that start with a zero).

Answered By: scottlittle
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.