pyspark change day in datetime column


what is wrong with this code trying to change day of a datetime columns

import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as sparktypes
import datetime

sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)

rdd = sc.parallelize([('a',datetime.datetime(2014, 1, 9, 0, 0)),
                      ('b',datetime.datetime(2014, 1, 27, 0, 0)),
                      ('c',datetime.datetime(2014, 1, 31, 0, 0))])
testdf = sqlcontext.createDataFrame(rdd, ["id", "date"])


gives a test dataframe:

| id|                date|
|  a|2014-01-09 00:00:...|
|  b|2014-01-27 00:00:...|
|  c|2014-01-31 00:00:...|

 |-- id: string (nullable = true)
 |-- date: timestamp (nullable = true)

Then I define a udf to change day of date column:

def change_day_(date, day):
    return date.replace(day=day)

change_day = sf.udf(change_day_, sparktypes.TimestampType())
testdf.withColumn("PaidMonth", change_day(, 1)).show(1)

This raises an error:

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
    at py4j.reflection.ReflectionEngine.getMethod(
    at py4j.reflection.ReflectionEngine.getMethod(
    at py4j.Gateway.invoke(
    at py4j.commands.AbstractCommand.invokeMethod(
    at py4j.commands.CallCommand.execute(
Asked By: muon



Thanks to @ArthurTacca’s comment, the trick is to use pyspark.sql.functions.lit() function like this:

testdf.withColumn("PaidMonth", change_day(, sf.lit(1))).show()

alternate answers welcome!

Answered By: muon

A udf which recieves multiple arguments is assumed to recieve multiple columns. The “1” is not a column.

This means you can do one of the following. Either make it a column as suggested in the comments:

testdf.withColumn("PaidMonth", change_day(, lit(1))).show(1)

lit(1) is a column of ones

or make the original function return a higher order function:

def change_day_(day):
    return lambda date: date.replace(day=day)

change_day = sf.udf(change_day_(1), sparktypes.TimestampType())
testdf.withColumn("PaidMonth", change_day(

This basically creates a function which replaces with 1 and therefore can recieve an integer. The udf would apply on a single column.

Answered By: Assaf Mendelson

I know it’s a bit late, but in case someone else comes across this.
If you prefer built-in functions you can use date_trunc() along with date_add() "to change day of a datetime columns".

# change to 1 as in the original question
testdf.withColumn("PaidMonth", f.date_trunc("mon", col("date")))
# change to another day eg. 6
testdf.withColumn("PaidMonth", f.date_add(f.date_trunc("mon", col("date")), 5))

Also useful for date manipulations are last_day() and date_sub()

Answered By: pmassie