How to change multiple columns' types in pyspark?
Question:
I am just studying pyspark. I want to change the column types like this:
df1=df.select(df.Date.cast('double'),df.Time.cast('double'),
df.NetValue.cast('double'),df.Units.cast('double'))
You can see that df is a data frame and I select 4 columns and change all of them to double. Because of using select, all other columns are ignored.
But, if df has hundreds of columns and I just need to change those 4 columns. I need to keep all the columns. So, how to do it?
Answers:
for c in df.columns:
# add condition for the cols to be type cast
df=df.withColumn(c, df[c].cast('double'))
Try this:
from pyspark.sql.functions import col
df = df.select([col(column).cast('double') for column in df.columns])
Another way using selectExpr()
:
df1 = df.selectExpr("cast(Date as double) Date",
"cast(NetValueas string) NetValue")
df1.printSchema()
Using withColumn()
:
from pyspark.sql.types import DoubleType, StringType
df1 = df.withColumn("Date", df["Date"].cast(DoubleType()))
.withColumn("NetValueas ", df["NetValueas"].cast(StringType()))
df1.printSchema()
Check types documentation.
I understand that you would like to have a non-for-loop answer that preserves the original set of columns whilst only updating a subset. The following should be the answer you were looking for:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("double").alias(c) for c in subset),*[x for x in df.columns if x not in subset])
where subset
is a list of the columnnames you would like to update.
I am just studying pyspark. I want to change the column types like this:
df1=df.select(df.Date.cast('double'),df.Time.cast('double'),
df.NetValue.cast('double'),df.Units.cast('double'))
You can see that df is a data frame and I select 4 columns and change all of them to double. Because of using select, all other columns are ignored.
But, if df has hundreds of columns and I just need to change those 4 columns. I need to keep all the columns. So, how to do it?
for c in df.columns:
# add condition for the cols to be type cast
df=df.withColumn(c, df[c].cast('double'))
Try this:
from pyspark.sql.functions import col
df = df.select([col(column).cast('double') for column in df.columns])
Another way using selectExpr()
:
df1 = df.selectExpr("cast(Date as double) Date",
"cast(NetValueas string) NetValue")
df1.printSchema()
Using withColumn()
:
from pyspark.sql.types import DoubleType, StringType
df1 = df.withColumn("Date", df["Date"].cast(DoubleType()))
.withColumn("NetValueas ", df["NetValueas"].cast(StringType()))
df1.printSchema()
Check types documentation.
I understand that you would like to have a non-for-loop answer that preserves the original set of columns whilst only updating a subset. The following should be the answer you were looking for:
from pyspark.sql.functions import col
df = df.select(*(col(c).cast("double").alias(c) for c in subset),*[x for x in df.columns if x not in subset])
where subset
is a list of the columnnames you would like to update.