PySpark: How to judge column type of dataframe
Question:
Suppose we have a dataframe called df
. I know there is way of using df.dtypes
. However I prefer something similar to
type(123) == int # note here the int is not a string
I wonder is there is something like:
type(df.select(<column_name>).collect()[0][1]) == IntegerType
Basically I want to know the way to directly get the object of the class like IntegerType, StringType
from the dataframe and then judge it.
Thanks!
Answers:
TL;DR Use external data types (plain Python types) to test values, internal data types (DataType
subclasses) to test schema.
First and foremost – You should never use
type(123) == int
Correct way to check types in Python, which handles inheritance, is
isinstance(123, int)
Having this done, lets talk about
Basically I want to know the way to directly get the object of the class like IntegerType, StringType from the dataframe and then judge it.
This is not how it works. DataTypes
describe schema (internal representation) not values. External types, is a plain Python object, so if internal type is IntegerType
, then external types is int
and so on, according to the rules defined in the Spark SQL Programming guide.
The only place where IntegerType
(or other DataTypes
) instance exist is your schema:
from pyspark.sql.types import *
df = spark.createDataFrame([(1, "foo")])
isinstance(df.schema["_1"].dataType, LongType)
# True
isinstance(df.schema["_2"].dataType, StringType)
# True
_1, _2 = df.first()
isinstance(_1, int)
# True
isinstance(_2, str)
# True
What about trying:
df.printSchema()
This will return something like:
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: integer (nullable = true)
|-- col4: date (nullable = true)
|-- col5: long (nullable = true)
If there’s a need to check detail structure under ArrayType
or StructType
schema, I’d still prefer using df.dtypes
, and then use XXXType.simpleString()
from the type object to verify the complex schema more easily.
For example,
import pyspark.sql.types as T
df_dtypes = dict(df.dtypes)
# {'column1': 'array<string>',
# 'column2': 'array<struct<fieldA:string,fieldB:bigint>>'}
### if want to verify the complex type schema
column1_require_type = T.ArrayType(T.StringType())
column2_require_type = T.ArrayType(T.StructType([
T.StructField("fieldA", T.StringType()),
T.StructField("fieldB", T.LongType()),
]))
column1_type_string = column1_require_type.simpleString() # array<string>
column2_type_string = column2_require_type.simpleString() # array<struct<fieldA:string,fieldB:bigint>>
# easy verification for complex structure
assert df_dtypes['column1'] == column1_type_string # True
assert df_dtypes['column2'] == column2_type_string # True
I think it’s helpful if need to verify complex schema. This works for me (I’m using PySpark 3.2)
Suppose we have a dataframe called df
. I know there is way of using df.dtypes
. However I prefer something similar to
type(123) == int # note here the int is not a string
I wonder is there is something like:
type(df.select(<column_name>).collect()[0][1]) == IntegerType
Basically I want to know the way to directly get the object of the class like IntegerType, StringType
from the dataframe and then judge it.
Thanks!
TL;DR Use external data types (plain Python types) to test values, internal data types (DataType
subclasses) to test schema.
First and foremost – You should never use
type(123) == int
Correct way to check types in Python, which handles inheritance, is
isinstance(123, int)
Having this done, lets talk about
Basically I want to know the way to directly get the object of the class like IntegerType, StringType from the dataframe and then judge it.
This is not how it works. DataTypes
describe schema (internal representation) not values. External types, is a plain Python object, so if internal type is IntegerType
, then external types is int
and so on, according to the rules defined in the Spark SQL Programming guide.
The only place where IntegerType
(or other DataTypes
) instance exist is your schema:
from pyspark.sql.types import *
df = spark.createDataFrame([(1, "foo")])
isinstance(df.schema["_1"].dataType, LongType)
# True
isinstance(df.schema["_2"].dataType, StringType)
# True
_1, _2 = df.first()
isinstance(_1, int)
# True
isinstance(_2, str)
# True
What about trying:
df.printSchema()
This will return something like:
root
|-- id: integer (nullable = true)
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: integer (nullable = true)
|-- col4: date (nullable = true)
|-- col5: long (nullable = true)
If there’s a need to check detail structure under ArrayType
or StructType
schema, I’d still prefer using df.dtypes
, and then use XXXType.simpleString()
from the type object to verify the complex schema more easily.
For example,
import pyspark.sql.types as T
df_dtypes = dict(df.dtypes)
# {'column1': 'array<string>',
# 'column2': 'array<struct<fieldA:string,fieldB:bigint>>'}
### if want to verify the complex type schema
column1_require_type = T.ArrayType(T.StringType())
column2_require_type = T.ArrayType(T.StructType([
T.StructField("fieldA", T.StringType()),
T.StructField("fieldB", T.LongType()),
]))
column1_type_string = column1_require_type.simpleString() # array<string>
column2_type_string = column2_require_type.simpleString() # array<struct<fieldA:string,fieldB:bigint>>
# easy verification for complex structure
assert df_dtypes['column1'] == column1_type_string # True
assert df_dtypes['column2'] == column2_type_string # True
I think it’s helpful if need to verify complex schema. This works for me (I’m using PySpark 3.2)