Combine multiple datetime string columns into one column in Polars

Question:

I have the following Python Code with pandas

df['EVENT_DATE'] = df.apply(
        lambda row: datetime.date(year=row.iyear, month=row.imonth, day=row.iday).strftime("%Y-%m-%d"), axis=1)

and want to transform it into a valid Polars Code. Does anyone have any idea to solve this?

Asked By: seb2704

||

Answers:

Polars apply will return the row data as a tuple, so you would need to use numerical indices instead. Example:

import datetime
import polars as pl

df = pl.DataFrame({"iyear": [2020, 2021],
                   "imonth": [1, 2],
                   "iday": [3, 4]})

df['EVENT_DATE'] = df.apply(
        lambda row: datetime.date(year=row[0], month=row[1], day=row[2]).strftime("%Y-%m-%d"))

In case dfcontains more columns, or in a different order, you could use apply on df[["iyear", "imonth", "iday"]] rather than df to ensure the indices refer to the right elements.

There may be better ways to achieve this, but this comes closest to the Pandas code.

On a separate note, my guess is you don’t want to store the dates as a string, but rather as a proper pl.Date. You could modify the code in this way:

def days_since_epoch(dt):
    return (dt - datetime.date(1970, 1, 1)).days


df['EVENT_DATE_dt'] = df.apply(
        lambda row: days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2])), return_dtype=pl.Date)

where we first convert the Python date to days since Jan 1, 1970, and then convert to a pl.Date using apply‘s return_dtype argument. The cast to pl.Date needs an int rather than a Python datetime, as it stores the data as an int ultimately. This is most easily seen by simply accessing the dates:

print(type(df["EVENT_DATE_dt"][0]))  # >>> <class 'int'>
print(type(df["EVENT_DATE_dt"].dt[0]))  # >>> <class 'datetime.date'>

Would be great if the cast does operate on the Python datetime directly.

edit: on the conversation on performance vs Pandas.
For both Pandas and Polars, you could speed this up further if you have many duplicate rows (for year/month/day), by using a cache to speedup the apply. I.e.

from functools import lru_cache

@lru_cache
def row_to_date(row):
    return days_since_epoch(datetime.date(year=row[0], month=row[1], day=row[2]))

df['EVENT_DATE_dt'] = df.apply(row_to_date, return_dtype=pl.Date)

This will improve runtime when there are many duplicate entries, at the expense of some memory. If there are no duplicates, it will probably slow you down.

Answered By: jvz

I will also answer your generic question and not only you specific use case.

For your specific case, as of polars version >= 0.10.18, the recommended method to create what you want is with the pl.date or pl.datetime expression.

Given this dataframe, pl.date is used to format the date as requested.

import polars as pl

df = pl.DataFrame({
    "iyear": [2001, 2001],
    "imonth": [1, 2],
    "iday": [1, 1]
})


df.with_columns([
    pl.date("iyear", "imonth", "iday").dt.strftime("%Y-%m-%d").alias("fmt")

])

This outputs:

shape: (2, 4)
┌───────┬────────┬──────┬────────────┐
│ iyear ┆ imonth ┆ iday ┆ fmt        │
│ ---   ┆ ---    ┆ ---  ┆ ---        │
│ i64   ┆ i64    ┆ i64  ┆ str        │
╞═══════╪════════╪══════╪════════════╡
│ 2001  ┆ 1      ┆ 1    ┆ 2001-01-01 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2001  ┆ 2      ┆ 1    ┆ 2001-02-01 │
└───────┴────────┴──────┴────────────┘

Other ways to collect other columns in a single expression

Below is a more generic answer on the main question. We can use a map to get multiple columns as Series, or if we know we want to format a string column, we can use pl.format. The map offers most utility.

df.with_columns([
    # string fmt over multiple expressions
    pl.format("{}-{}-{}", "iyear", "imonth", "iday").alias("date"),
    # columnar lambda over multiple expressions
    pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),
])

This outputs

shape: (2, 5)
┌───────┬────────┬──────┬──────────┬──────────┐
│ iyear ┆ imonth ┆ iday ┆ date     ┆ date2    │
│ ---   ┆ ---    ┆ ---  ┆ ---      ┆ ---      │
│ i64   ┆ i64    ┆ i64  ┆ str      ┆ str      │
╞═══════╪════════╪══════╪══════════╪══════════╡
│ 2001  ┆ 1      ┆ 1    ┆ 2001-1-1 ┆ 2001-1-1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2001  ┆ 2      ┆ 1    ┆ 2001-2-1 ┆ 2001-2-1 │
└───────┴────────┴──────┴──────────┴──────────┘

Avoid row-wise operations

Though, the accepted answer is correct in the result. It’s not the recommended way to apply operations over multiple columns in polars. Accessing rows is tremendously slow. Incurring a lot of cache misses, needing to run slow python bytecode and killing all parallelization/ query optimization.

Note

In this specific case, the map creating string data is not recommended:

pl.map(["iyear", "imonth", "iday"], lambda s: s[0] + "-" + s[1] + "-" + s[2]).alias("date2"),. Because the way memory is layed out and because we create a new column per string operation, this is actually quite expensive (Only with string data). Therefore there is the pl.format and pl.concat_str.

Answered By: ritchie46
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.