Access newly created column in .with_columns() when using polars

Question:

I am new to polars and I am not sure whether I am using .with_columns() correctly.

Here’s a situation I encounter frequently:
There’s a dataframe and in .with_columns(), I apply some operation to a column. For example, I convert some dates from str to date type and then want to compute the duration between start and end date. I’d implement this as follows.

import polars as pl 

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    [
        pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
        pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
    ]
).with_columns(
    [
        (pl.col("end") - pl.col("start")).alias("duration"),
    ]
)

First, I convert the two columns, next I call .with_columns() again.

Something shorter like this does not work:

pl.DataFrame(
    {
        "start": ["01.01.2019", "01.01.2020"],
        "end": ["11.01.2019", "01.05.2020"],
    }
).with_columns(
    [
        pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
        pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
        (pl.col("end") - pl.col("start")).alias("duration"),
    ]
)

Is there a way to avoid calling .with_columns() twice and to write this in a more compact way?

Asked By: Thomas

||

Answers:

The second .with_columns is needed.

From @DeanMacGregor

To elaborate, everything in a context (with_columns in this case) only knows about what’s in the dataframe before the context was called. Each expression in a context is unaware of every other expression in the context. This is by design because all the expressions run in parallel. If you need one expression to know the output of another expression, you need another context.

You could pass multiple names to .col() and use named args instead of .alias()

(df
 .with_columns(
    pl.col("start", "end").str.strptime(pl.Date, fmt="%d.%m.%Y"))
 .with_columns(
    duration = pl.col("end") - pl.col("start")))
shape: (2, 3)
┌────────────┬────────────┬──────────────┐
│ start      | end        | duration     │
│ ---        | ---        | ---          │
│ date       | date       | duration[ms] │
╞════════════╪════════════╪══════════════╡
│ 2019-01-01 | 2019-01-11 | 10d          │
│ 2020-01-01 | 2020-05-01 | 121d         │
└────────────┴────────────┴──────────────┘
Answered By: jqurious
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.