How to extract date from datetime column in polars
Question:
I am trying to move from pandas to polars but I am running into the following issue.
import polars as pl
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
Yields the following dataframe:
>>> df
shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET │
│ 2 ┆ 2010-02-01 01:00:00 CET │
│ 3 ┆ 2010-02-01 02:00:00 CET │
└─────────┴────────────────────────────────┘
As you can see, I transformed the datetime string from UTC to CET succesfully. However, when I try to extract the date (using the accepted answer by the polars author in this thread: https://stackoverflow.com/a/73212748/16332690), it seems to extract the date from the UTC string even though it has been transformed, e.g.:
df = df.with_columns(
[
pl.col("date").cast(pl.Date).alias("valueDay")
]
)
>>> df
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date ┆ valueDay │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ date │
╞═════════╪════════════════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 2010-01-31 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘
The valueDay
should be 2010-02-01 for all 3 values.
Can anyone help me fix this? By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df
or is there a way to chain all of this?
Edit:
I managed to find a quick way around this but it would be nice if the issue above could be addressed. A pandas dt.date
like way to approach this would be nice, I noticed that it is missing over here: https://pola-rs.github.io/polars/py-polars/html/reference/series/timeseries.html
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
df = df.with_columns(
[
pl.col("date").dt.day().alias("day"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.year().alias("year"),
]
)
df = df.with_columns(
pl.datetime(year=pl.col("year"), month=pl.col("month"), day=pl.col("day"))
)
df = df.with_columns(
[
pl.col("datetime").cast(pl.Date).alias("valueDay")
]
)
Yields the following:
>>> df
shape: (3, 7)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┬─────────────────────┬────────────┐
│ integer ┆ date ┆ day ┆ month ┆ year ┆ datetime ┆ valueDay │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32 ┆ i32 ┆ datetime[μs] ┆ date │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╪═════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┴─────────────────────┴────────────┘
Answers:
There’s nothing wrong with the cast
function per se. It’s just not intended to be used for truncating the time out of a datetime.
As it happens the function you’re looking for is truncate so this is what you want to do (in the last with_columns
chunk)
pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
).with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
).with_columns(
[
pl.col("date").dt.truncate('1d').alias("valueDay")
]
)
For completeness, in polars 0.16.15+, dt
namespace methods .date()
, .time()
etc. return local datetime attributes. EX:
import polars as pl
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00",
],
}
)
df = df.with_columns(pl.col("date").str.strptime(pl.Datetime, fmt="%+")
.dt.convert_time_zone("Europe/Amsterdam"))
df = df.with_columns(
[
pl.col("date").dt.day().alias("day"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.year().alias("year"),
]
)
print(df)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┐
│ integer ┆ date ┆ day ┆ month ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32 ┆ i32 │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 1 ┆ 2 ┆ 2010 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 1 ┆ 2 ┆ 2010 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 1 ┆ 2 ┆ 2010 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┘
I am trying to move from pandas to polars but I am running into the following issue.
import polars as pl
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
Yields the following dataframe:
>>> df
shape: (3, 2)
┌─────────┬────────────────────────────────┐
│ integer ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] │
╞═════════╪════════════════════════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET │
│ 2 ┆ 2010-02-01 01:00:00 CET │
│ 3 ┆ 2010-02-01 02:00:00 CET │
└─────────┴────────────────────────────────┘
As you can see, I transformed the datetime string from UTC to CET succesfully. However, when I try to extract the date (using the accepted answer by the polars author in this thread: https://stackoverflow.com/a/73212748/16332690), it seems to extract the date from the UTC string even though it has been transformed, e.g.:
df = df.with_columns(
[
pl.col("date").cast(pl.Date).alias("valueDay")
]
)
>>> df
shape: (3, 3)
┌─────────┬────────────────────────────────┬────────────┐
│ integer ┆ date ┆ valueDay │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ date │
╞═════════╪════════════════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 2010-01-31 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴────────────┘
The valueDay
should be 2010-02-01 for all 3 values.
Can anyone help me fix this? By the way, what is the best way to optimize this code? Do I constantly have to assign everything to df
or is there a way to chain all of this?
Edit:
I managed to find a quick way around this but it would be nice if the issue above could be addressed. A pandas dt.date
like way to approach this would be nice, I noticed that it is missing over here: https://pola-rs.github.io/polars/py-polars/html/reference/series/timeseries.html
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
)
df = df.with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
)
df = df.with_columns(
[
pl.col("date").dt.day().alias("day"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.year().alias("year"),
]
)
df = df.with_columns(
pl.datetime(year=pl.col("year"), month=pl.col("month"), day=pl.col("day"))
)
df = df.with_columns(
[
pl.col("datetime").cast(pl.Date).alias("valueDay")
]
)
Yields the following:
>>> df
shape: (3, 7)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┬─────────────────────┬────────────┐
│ integer ┆ date ┆ day ┆ month ┆ year ┆ datetime ┆ valueDay │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32 ┆ i32 ┆ datetime[μs] ┆ date │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╪═════════════════════╪════════════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 1 ┆ 2 ┆ 2010 ┆ 2010-02-01 00:00:00 ┆ 2010-02-01 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┴─────────────────────┴────────────┘
There’s nothing wrong with the cast
function per se. It’s just not intended to be used for truncating the time out of a datetime.
As it happens the function you’re looking for is truncate so this is what you want to do (in the last with_columns
chunk)
pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00"
]
}
).with_columns(
[
pl.col("date").str.strptime(pl.Datetime, fmt="%Y-%m-%dT%H:%M:%S%z").dt.with_time_zone("Europe/Amsterdam"),
]
).with_columns(
[
pl.col("date").dt.truncate('1d').alias("valueDay")
]
)
For completeness, in polars 0.16.15+, dt
namespace methods .date()
, .time()
etc. return local datetime attributes. EX:
import polars as pl
df = pl.DataFrame(
{
"integer": [1, 2, 3],
"date": [
"2010-01-31T23:00:00+00:00",
"2010-02-01T00:00:00+00:00",
"2010-02-01T01:00:00+00:00",
],
}
)
df = df.with_columns(pl.col("date").str.strptime(pl.Datetime, fmt="%+")
.dt.convert_time_zone("Europe/Amsterdam"))
df = df.with_columns(
[
pl.col("date").dt.day().alias("day"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.year().alias("year"),
]
)
print(df)
┌─────────┬────────────────────────────────┬─────┬───────┬──────┐
│ integer ┆ date ┆ day ┆ month ┆ year │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs, Europe/Amsterdam] ┆ u32 ┆ u32 ┆ i32 │
╞═════════╪════════════════════════════════╪═════╪═══════╪══════╡
│ 1 ┆ 2010-02-01 00:00:00 CET ┆ 1 ┆ 2 ┆ 2010 │
│ 2 ┆ 2010-02-01 01:00:00 CET ┆ 1 ┆ 2 ┆ 2010 │
│ 3 ┆ 2010-02-01 02:00:00 CET ┆ 1 ┆ 2 ┆ 2010 │
└─────────┴────────────────────────────────┴─────┴───────┴──────┘