Is there a way to utilize polars mapping to make this code more efficient?
Question:
I have some polars code that functionally can do what I want, but I feel it is an inefficient implementation at best. I feel that there must be some way to achieve the same result with .map()
, but I can’t figure out how. Any thoughts or suggestions?
Specifically, my data are organized as follows: Each column is a location, and each row is a datetime. What I’m trying to do is calculate the maximum count of consecutive non-zero values (which I converted to Booleans because I don’t need the magnitude of the value, I just need to know if the value is zero or not). Example data and example expected output below:
Example Dummy Data
Date
Location 1
Location 2
01-01-23 00:00
0
1
01-01-23 01:00
1
1
01-01-23 02:00
1
1
01-01-23 03:00
0
1
01-01-23 04:00
1
1
01-01-23 05:00
1
0
01-01-23 06:00
1
0
Expected Output:
Location
Maximum Cumulative Count
Location 1
3
Location 2
5
Below is the code I have that is functional, but feels like it can be improved my someone smarter and more well-versed in polars than I am.
for col in pivoted_df.drop("Date").columns:
xy_cont_df_a = (
pivoted_df.select(pl.col(col))
.with_columns(
pl.when(
pl.col(col).cast(pl.Boolean)
& pl.col(col)
.cast(pl.Boolean)
.shift_and_fill(-1, False)
.is_not()
).then(
pl.count().over(
(
pl.col(col).cast(pl.Boolean)
!= pl.col(col).cast(pl.Boolean).shift()
).cumsum()
)
)
)
.max()
)
Answers:
You can do all the columns at once:
columns = pl.exclude("Date")
df.select(
(columns != 0).cumsum()
- (pl.when(columns == 0)
.then(columns.cumsum())
.forward_fill()
.fill_null(0))
).max()
shape: (1, 2)
┌────────────┬────────────┐
│ Location 1 ┆ Location 2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════════╪════════════╡
│ 3 ┆ 5 │
└────────────┴────────────┘
Your example mentions pivoted_df
which suggests you may have used a .pivot
to get to this point.
If that is the case, there may be a simpler way to get these counts from an earlier step in your current process.
Edit:
If you had a "flat" frame:
>>> df
shape: (14, 3)
┌─────────────────────┬────────────┬───────┐
│ Date ┆ Location ┆ Value │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ str ┆ i64 │
╞═════════════════════╪════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ Location 1 ┆ 0 │
│ 2023-01-01 01:00:00 ┆ Location 1 ┆ 1 │
│ 2023-01-01 02:00:00 ┆ Location 1 ┆ 1 │
│ 2023-01-01 03:00:00 ┆ Location 1 ┆ 0 │
│ … ┆ … ┆ … │
│ 2023-01-01 03:00:00 ┆ Location 2 ┆ 1 │
│ 2023-01-01 04:00:00 ┆ Location 2 ┆ 1 │
│ 2023-01-01 05:00:00 ┆ Location 2 ┆ 0 │
│ 2023-01-01 06:00:00 ┆ Location 2 ┆ 0 │
└─────────────────────┴────────────┴───────┘
One possible approach:
consecutive = (
((pl.col("Value") != 0) != (pl.col("Value") != 0).shift())
.cumsum().over("Location")
)
(df.with_columns(pl.count().over(pl.struct(["Location", consecutive])))
.groupby("Location")
.agg(pl.max("count"))
)
shape: (2, 2)
┌────────────┬───────┐
│ Location ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═══════╡
│ Location 1 ┆ 3 │
│ Location 2 ┆ 5 │
└────────────┴───────┘
I have some polars code that functionally can do what I want, but I feel it is an inefficient implementation at best. I feel that there must be some way to achieve the same result with .map()
, but I can’t figure out how. Any thoughts or suggestions?
Specifically, my data are organized as follows: Each column is a location, and each row is a datetime. What I’m trying to do is calculate the maximum count of consecutive non-zero values (which I converted to Booleans because I don’t need the magnitude of the value, I just need to know if the value is zero or not). Example data and example expected output below:
Example Dummy Data
Date | Location 1 | Location 2 |
---|---|---|
01-01-23 00:00 | 0 | 1 |
01-01-23 01:00 | 1 | 1 |
01-01-23 02:00 | 1 | 1 |
01-01-23 03:00 | 0 | 1 |
01-01-23 04:00 | 1 | 1 |
01-01-23 05:00 | 1 | 0 |
01-01-23 06:00 | 1 | 0 |
Expected Output:
Location | Maximum Cumulative Count |
---|---|
Location 1 | 3 |
Location 2 | 5 |
Below is the code I have that is functional, but feels like it can be improved my someone smarter and more well-versed in polars than I am.
for col in pivoted_df.drop("Date").columns:
xy_cont_df_a = (
pivoted_df.select(pl.col(col))
.with_columns(
pl.when(
pl.col(col).cast(pl.Boolean)
& pl.col(col)
.cast(pl.Boolean)
.shift_and_fill(-1, False)
.is_not()
).then(
pl.count().over(
(
pl.col(col).cast(pl.Boolean)
!= pl.col(col).cast(pl.Boolean).shift()
).cumsum()
)
)
)
.max()
)
You can do all the columns at once:
columns = pl.exclude("Date")
df.select(
(columns != 0).cumsum()
- (pl.when(columns == 0)
.then(columns.cumsum())
.forward_fill()
.fill_null(0))
).max()
shape: (1, 2)
┌────────────┬────────────┐
│ Location 1 ┆ Location 2 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞════════════╪════════════╡
│ 3 ┆ 5 │
└────────────┴────────────┘
Your example mentions pivoted_df
which suggests you may have used a .pivot
to get to this point.
If that is the case, there may be a simpler way to get these counts from an earlier step in your current process.
Edit:
If you had a "flat" frame:
>>> df
shape: (14, 3)
┌─────────────────────┬────────────┬───────┐
│ Date ┆ Location ┆ Value │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ str ┆ i64 │
╞═════════════════════╪════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ Location 1 ┆ 0 │
│ 2023-01-01 01:00:00 ┆ Location 1 ┆ 1 │
│ 2023-01-01 02:00:00 ┆ Location 1 ┆ 1 │
│ 2023-01-01 03:00:00 ┆ Location 1 ┆ 0 │
│ … ┆ … ┆ … │
│ 2023-01-01 03:00:00 ┆ Location 2 ┆ 1 │
│ 2023-01-01 04:00:00 ┆ Location 2 ┆ 1 │
│ 2023-01-01 05:00:00 ┆ Location 2 ┆ 0 │
│ 2023-01-01 06:00:00 ┆ Location 2 ┆ 0 │
└─────────────────────┴────────────┴───────┘
One possible approach:
consecutive = (
((pl.col("Value") != 0) != (pl.col("Value") != 0).shift())
.cumsum().over("Location")
)
(df.with_columns(pl.count().over(pl.struct(["Location", consecutive])))
.groupby("Location")
.agg(pl.max("count"))
)
shape: (2, 2)
┌────────────┬───────┐
│ Location ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════════╪═══════╡
│ Location 1 ┆ 3 │
│ Location 2 ┆ 5 │
└────────────┴───────┘