Access newly created column in .with_columns() when using polars
Question:
I am new to polars and I am not sure whether I am using .with_columns()
correctly.
Here’s a situation I encounter frequently:
There’s a dataframe and in .with_columns()
, I apply some operation to a column. For example, I convert some dates from str
to date
type and then want to compute the duration between start and end date. I’d implement this as follows.
import polars as pl
pl.DataFrame(
{
"start": ["01.01.2019", "01.01.2020"],
"end": ["11.01.2019", "01.05.2020"],
}
).with_columns(
[
pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
]
).with_columns(
[
(pl.col("end") - pl.col("start")).alias("duration"),
]
)
First, I convert the two columns, next I call .with_columns()
again.
Something shorter like this does not work:
pl.DataFrame(
{
"start": ["01.01.2019", "01.01.2020"],
"end": ["11.01.2019", "01.05.2020"],
}
).with_columns(
[
pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
(pl.col("end") - pl.col("start")).alias("duration"),
]
)
Is there a way to avoid calling .with_columns()
twice and to write this in a more compact way?
Answers:
The second .with_columns
is needed.
From @DeanMacGregor
To elaborate, everything in a context (with_columns
in this case) only knows about what’s in the dataframe before the context was called. Each expression in a context is unaware of every other expression in the context. This is by design because all the expressions run in parallel. If you need one expression to know the output of another expression, you need another context.
You could pass multiple names to .col()
and use named args instead of .alias()
(df
.with_columns(
pl.col("start", "end").str.strptime(pl.Date, fmt="%d.%m.%Y"))
.with_columns(
duration = pl.col("end") - pl.col("start")))
shape: (2, 3)
┌────────────┬────────────┬──────────────┐
│ start | end | duration │
│ --- | --- | --- │
│ date | date | duration[ms] │
╞════════════╪════════════╪══════════════╡
│ 2019-01-01 | 2019-01-11 | 10d │
│ 2020-01-01 | 2020-05-01 | 121d │
└────────────┴────────────┴──────────────┘
I am new to polars and I am not sure whether I am using .with_columns()
correctly.
Here’s a situation I encounter frequently:
There’s a dataframe and in .with_columns()
, I apply some operation to a column. For example, I convert some dates from str
to date
type and then want to compute the duration between start and end date. I’d implement this as follows.
import polars as pl
pl.DataFrame(
{
"start": ["01.01.2019", "01.01.2020"],
"end": ["11.01.2019", "01.05.2020"],
}
).with_columns(
[
pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
]
).with_columns(
[
(pl.col("end") - pl.col("start")).alias("duration"),
]
)
First, I convert the two columns, next I call .with_columns()
again.
Something shorter like this does not work:
pl.DataFrame(
{
"start": ["01.01.2019", "01.01.2020"],
"end": ["11.01.2019", "01.05.2020"],
}
).with_columns(
[
pl.col("start").str.strptime(pl.Date, fmt="%d.%m.%Y"),
pl.col("end").str.strptime(pl.Date, fmt="%d.%m.%Y"),
(pl.col("end") - pl.col("start")).alias("duration"),
]
)
Is there a way to avoid calling .with_columns()
twice and to write this in a more compact way?
The second .with_columns
is needed.
From @DeanMacGregor
To elaborate, everything in a context (
with_columns
in this case) only knows about what’s in the dataframe before the context was called. Each expression in a context is unaware of every other expression in the context. This is by design because all the expressions run in parallel. If you need one expression to know the output of another expression, you need another context.
You could pass multiple names to .col()
and use named args instead of .alias()
(df
.with_columns(
pl.col("start", "end").str.strptime(pl.Date, fmt="%d.%m.%Y"))
.with_columns(
duration = pl.col("end") - pl.col("start")))
shape: (2, 3)
┌────────────┬────────────┬──────────────┐
│ start | end | duration │
│ --- | --- | --- │
│ date | date | duration[ms] │
╞════════════╪════════════╪══════════════╡
│ 2019-01-01 | 2019-01-11 | 10d │
│ 2020-01-01 | 2020-05-01 | 121d │
└────────────┴────────────┴──────────────┘