I have a date column in a pyspark dataframe that I want to change the title of and extract only the last 8 characters from while preserving its order

Question:

my dataframe looks like this:

| accountId | income | dateOfOrder 
| 123       | 60000  | 56347264327_01_20200110
| 321       | 52000  | 54346262452_01_20200218

I want to take the header dateOfOrder and change it to acct_order_dt and only use the last 8 characters which are dates in yyyymmdd. I want to preserve the order of this pyspark dataframe.

I am currently using this method but I dont think it’s preserving the order:

sample_data = sample_data.withColumn("acct_order_dt", to_date(substring(col("dateOfOrder"),-8,8), "yyyyMMdd")).drop("dateOfOrder")
Asked By: bmfy131

||

Answers:

To achieve this, you can use the withColumn method and the substring function. Here’s an example code in PySpark:

from pyspark.sql.functions import substring

df = df.withColumn("acct_order_dt", substring(df["dateOfOrder"], -8, 8))
df = df.drop("dateOfOrder")
df = df.selectExpr("accountId", "income", "acct_order_dt")

The first line uses withColumn to add a new column acct_order_dt to the DataFrame. The substring function is used to extract the last 8 characters of the dateOfOrder column and store them in acct_order_dt.

scala apache Spark snippet :
import org.apache.spark.sql.functions._

val df2 = df.withColumn("acct_order_dt", substring(col("dateOfOrder"), -8, 8))
  .drop("dateOfOrder")
  .select("accountId", "income", "acct_order_dt")

since you got the 8 characters you can apply any sql function of your choice to convert your required date format

Answered By: Ram Ghadiyaram