Is there a way to utilize polars mapping to make this code more efficient?

Question:

I have some polars code that functionally can do what I want, but I feel it is an inefficient implementation at best. I feel that there must be some way to achieve the same result with .map(), but I can’t figure out how. Any thoughts or suggestions?

Specifically, my data are organized as follows: Each column is a location, and each row is a datetime. What I’m trying to do is calculate the maximum count of consecutive non-zero values (which I converted to Booleans because I don’t need the magnitude of the value, I just need to know if the value is zero or not). Example data and example expected output below:

Example Dummy Data

Date Location 1 Location 2
01-01-23 00:00 0 1
01-01-23 01:00 1 1
01-01-23 02:00 1 1
01-01-23 03:00 0 1
01-01-23 04:00 1 1
01-01-23 05:00 1 0
01-01-23 06:00 1 0

Expected Output:

Location Maximum Cumulative Count
Location 1 3
Location 2 5

Below is the code I have that is functional, but feels like it can be improved my someone smarter and more well-versed in polars than I am.

for col in pivoted_df.drop("Date").columns:
    xy_cont_df_a = (
        pivoted_df.select(pl.col(col))
        .with_columns(
            pl.when(
                pl.col(col).cast(pl.Boolean)
                & pl.col(col)
                .cast(pl.Boolean)
                .shift_and_fill(-1, False)
                .is_not()
            ).then(
                pl.count().over(
                    (
                        pl.col(col).cast(pl.Boolean)
                        != pl.col(col).cast(pl.Boolean).shift()
                    ).cumsum()
                )
            )
        )
        .max()
    )
Asked By: bdshoener

||

Answers:

You can do all the columns at once:

columns = pl.exclude("Date")

df.select(
   (columns != 0).cumsum() 
    - (pl.when(columns == 0)
         .then(columns.cumsum())
         .forward_fill()
         .fill_null(0))
).max()
shape: (1, 2)
┌────────────┬────────────┐
│ Location 1 ┆ Location 2 │
│ ---        ┆ ---        │
│ i64        ┆ i64        │
╞════════════╪════════════╡
│ 3          ┆ 5          │
└────────────┴────────────┘

Your example mentions pivoted_df which suggests you may have used a .pivot to get to this point.

If that is the case, there may be a simpler way to get these counts from an earlier step in your current process.


Edit:

If you had a "flat" frame:

>>> df
shape: (14, 3)
┌─────────────────────┬────────────┬───────┐
│ Date                ┆ Location   ┆ Value │
│ ---                 ┆ ---        ┆ ---   │
│ datetime[ns]        ┆ str        ┆ i64   │
╞═════════════════════╪════════════╪═══════╡
│ 2023-01-01 00:00:00 ┆ Location 1 ┆ 0     │
│ 2023-01-01 01:00:00 ┆ Location 1 ┆ 1     │
│ 2023-01-01 02:00:00 ┆ Location 1 ┆ 1     │
│ 2023-01-01 03:00:00 ┆ Location 1 ┆ 0     │
│ …                   ┆ …          ┆ …     │
│ 2023-01-01 03:00:00 ┆ Location 2 ┆ 1     │
│ 2023-01-01 04:00:00 ┆ Location 2 ┆ 1     │
│ 2023-01-01 05:00:00 ┆ Location 2 ┆ 0     │
│ 2023-01-01 06:00:00 ┆ Location 2 ┆ 0     │
└─────────────────────┴────────────┴───────┘

One possible approach:

consecutive = (
   ((pl.col("Value") != 0) != (pl.col("Value") != 0).shift())
   .cumsum().over("Location")
)

(df.with_columns(pl.count().over(pl.struct(["Location", consecutive])))
   .groupby("Location")
   .agg(pl.max("count"))
)
shape: (2, 2)
┌────────────┬───────┐
│ Location   ┆ count │
│ ---        ┆ ---   │
│ str        ┆ u32   │
╞════════════╪═══════╡
│ Location 1 ┆ 3     │
│ Location 2 ┆ 5     │
└────────────┴───────┘
Answered By: jqurious