Python Polars Rolling Count

Question:

There are some known rolling functions in polars, namely rolling_mean(), rolling_apply() and rolling_max(). However, if I would like to get a count on the number of occurrence of a value in each window, how should that be done?

Let’s say we now have a LazyFrame:

df = pl.LazyFrame({"Date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-05", "2023-01-10", "2023-01-11", "2023-01-12"], "Pattern": [True, True, False, True, False, False, True]})
┌────────────┬─────────┐
│ Date       ┆ Pattern │
│ ---        ┆ ---     │
│ str        ┆ bool    │
╞════════════╪═════════╡
│ 2023-01-01 ┆ true    │
│ 2023-01-02 ┆ true    │
│ 2023-01-03 ┆ false   │
│ 2023-01-05 ┆ true    │
│ 2023-01-10 ┆ false   │
│ 2023-01-11 ┆ false   │
│ 2023-01-12 ┆ true    │
└────────────┴─────────┘

And the desired out come, for n = 3 and pattern = True, would be:

┌────────────┬─────────┐
│ Date       ┆ Count   │
│ ---        ┆ ---     │
│ str        ┆ int     │
╞════════════╪═════════╡
│ 2023-01-01 ┆ null    │
│ 2023-01-02 ┆ null    │
│ 2023-01-03 ┆ 2       │
│ 2023-01-05 ┆ 2       │
│ 2023-01-10 ┆ 1       │
│ 2023-01-11 ┆ 1       │
│ 2023-01-12 ┆ 1       │
└────────────┴─────────┘

I have tried using rolling_sum() over the column Pattern, yet since my column is of Boolean type, using such functions would only yield an error.

In pandas, this can be achieved by:

df["Pattern"].apply(lambda x: x == True).rolling(3, min_periods = 0).sum()

What is the proper way of doing it with polars? and, how can I generalise the solution to columns other than the True value of Boolean, for example, False and categorical data?

Asked By: SpeedowaGONE

||

Answers:

Update: A perhaps more polars approach to the min_periods = 0 example:

window_size = 3

df.with_columns(
   pl.sum(
      pl.col("Pattern").shift(n) == True 
      for n in range(window_size))
)
shape: (7, 3)
┌────────────┬─────────┬─────┐
│ Date       ┆ Pattern ┆ sum │
│ ---        ┆ ---     ┆ --- │
│ str        ┆ bool    ┆ u32 │
╞════════════╪═════════╪═════╡
│ 2023-01-01 ┆ true    ┆ 1   │
│ 2023-01-02 ┆ true    ┆ 2   │
│ 2023-01-03 ┆ false   ┆ 2   │
│ 2023-01-05 ┆ true    ┆ 2   │
│ 2023-01-10 ┆ false   ┆ 1   │
│ 2023-01-11 ┆ false   ┆ 1   │
│ 2023-01-12 ┆ true    ┆ 1   │
└────────────┴─────────┴─────┘

You could use .groupby_rolling with an integer column as the index.

pl.arange(0, pl.count()) is one way to generate such a column.

(df
 .with_columns(idx = pl.arange(0, pl.count()))
 .groupby_rolling(index_column="idx", period="3i").agg(
    pl.col("Date").last(),
    (pl.col("Pattern") == True).sum())
)
shape: (7, 3)
┌─────┬────────────┬─────────┐
│ idx ┆ Date       ┆ Pattern │
│ --- ┆ ---        ┆ ---     │
│ i64 ┆ str        ┆ u32     │
╞═════╪════════════╪═════════╡
│ 0   ┆ 2023-01-01 ┆ 1       │
│ 1   ┆ 2023-01-02 ┆ 2       │
│ 2   ┆ 2023-01-03 ┆ 2       │
│ 3   ┆ 2023-01-05 ┆ 2       │
│ 4   ┆ 2023-01-10 ┆ 1       │
│ 5   ┆ 2023-01-11 ┆ 1       │
│ 6   ┆ 2023-01-12 ┆ 1       │
└─────┴────────────┴─────────┘

To produce the null values you could add a check with pl.when()

(df
 .with_columns(idx = pl.arange(0, pl.count()))
 .groupby_rolling(index_column="idx", period="3i").agg(
    pl.col("Date").last(),
    pl.when(pl.count() > 2)
      .then(
         (pl.col("Pattern") == True).sum()))
)
shape: (7, 3)
┌─────┬────────────┬─────────┐
│ idx ┆ Date       ┆ Pattern │
│ --- ┆ ---        ┆ ---     │
│ i64 ┆ str        ┆ u32     │
╞═════╪════════════╪═════════╡
│ 0   ┆ 2023-01-01 ┆ null    │
│ 1   ┆ 2023-01-02 ┆ null    │
│ 2   ┆ 2023-01-03 ┆ 2       │
│ 3   ┆ 2023-01-05 ┆ 2       │
│ 4   ┆ 2023-01-10 ┆ 1       │
│ 5   ┆ 2023-01-11 ┆ 1       │
│ 6   ┆ 2023-01-12 ┆ 1       │
└─────┴────────────┴─────────┘
Answered By: jqurious

I have tried using rolling_sum() over the column Pattern, yet since my column is of Boolean type, using such functions would only yield an error.

You can use rolling_sum, you just need to first cast the boolean expression to a numeric type using Expr.cast (False values are converted to 0 and True values to 1).

n = 3
pattern_value = True

res = df.with_columns(
    (pl.col('Pattern') == pattern_value).cast(pl.UInt8).rolling_sum(window_size=n)
)

For pattern_value = True using (pl.col('Pattern') == pattern_value) is redundant, you can just use pl.col('Pattern').cast(pl.UInt8).

Output:

>>> res

shape: (7, 2)
┌────────────┬─────────┐
│ Date       ┆ Pattern │
│ ---        ┆ ---     │
│ str        ┆ u8      │
╞════════════╪═════════╡
│ 2023-01-01 ┆ null    │
│ 2023-01-02 ┆ null    │
│ 2023-01-03 ┆ 2       │
│ 2023-01-05 ┆ 2       │
│ 2023-01-10 ┆ 1       │
│ 2023-01-11 ┆ 1       │
│ 2023-01-12 ┆ 1       │
└────────────┴─────────┘

In pandas, this can be achieved by:

df["Pattern"].apply(lambda x: x == True).rolling(3, min_periods = 0).sum()

Note that using min_periods = 0 does not produce the output mentioned. The Expr.rolling_sum method also accepts the min_periods parameter if you want (which defaults to the window size as well). For instance


>>> df.with_columns(
    pl.col('Pattern').cast(pl.UInt8).rolling_sum(window_size=3, min_periods=0)
)

shape: (7, 2)
┌────────────┬─────────┐
│ Date       ┆ Pattern │
│ ---        ┆ ---     │
│ str        ┆ u8      │
╞════════════╪═════════╡
│ 2023-01-01 ┆ 1       │
│ 2023-01-02 ┆ 2       │
│ 2023-01-03 ┆ 2       │
│ 2023-01-05 ┆ 2       │
│ 2023-01-10 ┆ 1       │
│ 2023-01-11 ┆ 1       │
│ 2023-01-12 ┆ 1       │
└────────────┴─────────┘
Answered By: Rodalm