polars – Fill null over Groups
Question:
I am trying to fill null timestamps over groups, my dataframe looks like this
start
stop
group
0
2021-12-08 06:40:53.734941+01:00
2022-05-16 10:16:18.717146+02:00
1
1
2021-12-08 06:40:55.191598+01:00
null
1
2
2021-12-08 10:39:12.421402+01:00
2022-05-16 10:16:19.816922+02:00
2
3
2021-12-08 10:39:12.634873+01:00
2022-05-16 10:16:19.817304+02:00
1
4
2021-12-08 10:49:47.392815+01:00
2022-05-16 10:16:20.178050+02:00
5
The stop timestamps should be imputed from the next start.
My current implantation is:
df = df.sort("start").with_columns(
pl.col("start").fill_null(strategy="backward").over("group")
)
df = df.groupby("group").apply(
lambda df: df.with_columns(
pl.col("stop").fill_null(
pl.col("start").shift(-1)
)
),
schema=None,
)
This current implementation is slow. Is there a faster way to solve this ?
Thanks in advance
Answers:
You can use pl.coalesce()
.
Here is an example:
df.with_columns(
stop = pl.coalesce(
pl.col('stop'), pl.col('start').shift(-1).over('group')
)
)
shape: (5, 4)
┌─────┬──────────────────────────────────┬──────────────────────────────────┬───────┐
│ ┆ start ┆ stop ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════════════════════════════╪══════════════════════════════════╪═══════╡
│ 0 ┆ 2021-12-08 06:40:53.734941+01:00 ┆ 2022-05-16 10:16:18.717146+02:00 ┆ 1 │
│ 1 ┆ 2021-12-08 06:40:55.191598+01:00 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 1 │
│ 2 ┆ 2021-12-08 10:39:12.421402+01:00 ┆ 2022-05-16 10:16:19.816922+02:00 ┆ 2 │
│ 3 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 2022-05-16 10:16:19.817304+02:00 ┆ 1 │
│ 4 ┆ 2021-12-08 10:49:47.392815+01:00 ┆ 2022-05-16 10:16:20.178050+02:00 ┆ 5 │
└─────┴──────────────────────────────────┴──────────────────────────────────┴───────┘
In the example I used .over() so that it finds the next start within the same group.
I am trying to fill null timestamps over groups, my dataframe looks like this
start | stop | group | |
---|---|---|---|
0 | 2021-12-08 06:40:53.734941+01:00 | 2022-05-16 10:16:18.717146+02:00 | 1 |
1 | 2021-12-08 06:40:55.191598+01:00 | null | 1 |
2 | 2021-12-08 10:39:12.421402+01:00 | 2022-05-16 10:16:19.816922+02:00 | 2 |
3 | 2021-12-08 10:39:12.634873+01:00 | 2022-05-16 10:16:19.817304+02:00 | 1 |
4 | 2021-12-08 10:49:47.392815+01:00 | 2022-05-16 10:16:20.178050+02:00 | 5 |
The stop timestamps should be imputed from the next start.
My current implantation is:
df = df.sort("start").with_columns(
pl.col("start").fill_null(strategy="backward").over("group")
)
df = df.groupby("group").apply(
lambda df: df.with_columns(
pl.col("stop").fill_null(
pl.col("start").shift(-1)
)
),
schema=None,
)
This current implementation is slow. Is there a faster way to solve this ?
Thanks in advance
You can use pl.coalesce()
.
Here is an example:
df.with_columns(
stop = pl.coalesce(
pl.col('stop'), pl.col('start').shift(-1).over('group')
)
)
shape: (5, 4)
┌─────┬──────────────────────────────────┬──────────────────────────────────┬───────┐
│ ┆ start ┆ stop ┆ group │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════════════════════════════╪══════════════════════════════════╪═══════╡
│ 0 ┆ 2021-12-08 06:40:53.734941+01:00 ┆ 2022-05-16 10:16:18.717146+02:00 ┆ 1 │
│ 1 ┆ 2021-12-08 06:40:55.191598+01:00 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 1 │
│ 2 ┆ 2021-12-08 10:39:12.421402+01:00 ┆ 2022-05-16 10:16:19.816922+02:00 ┆ 2 │
│ 3 ┆ 2021-12-08 10:39:12.634873+01:00 ┆ 2022-05-16 10:16:19.817304+02:00 ┆ 1 │
│ 4 ┆ 2021-12-08 10:49:47.392815+01:00 ┆ 2022-05-16 10:16:20.178050+02:00 ┆ 5 │
└─────┴──────────────────────────────────┴──────────────────────────────────┴───────┘
In the example I used .over() so that it finds the next start within the same group.