Python Polars Rolling Count
Question:
There are some known rolling functions in polars, namely rolling_mean(), rolling_apply() and rolling_max(). However, if I would like to get a count on the number of occurrence of a value in each window, how should that be done?
Let’s say we now have a LazyFrame:
df = pl.LazyFrame({"Date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-05", "2023-01-10", "2023-01-11", "2023-01-12"], "Pattern": [True, True, False, True, False, False, True]})
┌────────────┬─────────┐
│ Date ┆ Pattern │
│ --- ┆ --- │
│ str ┆ bool │
╞════════════╪═════════╡
│ 2023-01-01 ┆ true │
│ 2023-01-02 ┆ true │
│ 2023-01-03 ┆ false │
│ 2023-01-05 ┆ true │
│ 2023-01-10 ┆ false │
│ 2023-01-11 ┆ false │
│ 2023-01-12 ┆ true │
└────────────┴─────────┘
And the desired out come, for n = 3 and pattern = True, would be:
┌────────────┬─────────┐
│ Date ┆ Count │
│ --- ┆ --- │
│ str ┆ int │
╞════════════╪═════════╡
│ 2023-01-01 ┆ null │
│ 2023-01-02 ┆ null │
│ 2023-01-03 ┆ 2 │
│ 2023-01-05 ┆ 2 │
│ 2023-01-10 ┆ 1 │
│ 2023-01-11 ┆ 1 │
│ 2023-01-12 ┆ 1 │
└────────────┴─────────┘
I have tried using rolling_sum() over the column Pattern, yet since my column is of Boolean type, using such functions would only yield an error.
In pandas, this can be achieved by:
df["Pattern"].apply(lambda x: x == True).rolling(3, min_periods = 0).sum()
What is the proper way of doing it with polars? and, how can I generalise the solution to columns other than the True value of Boolean, for example, False and categorical data?
Answers:
Update: A perhaps more polars approach to the min_periods = 0
example:
window_size = 3
df.with_columns(
pl.sum(
pl.col("Pattern").shift(n) == True
for n in range(window_size))
)
shape: (7, 3)
┌────────────┬─────────┬─────┐
│ Date ┆ Pattern ┆ sum │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ u32 │
╞════════════╪═════════╪═════╡
│ 2023-01-01 ┆ true ┆ 1 │
│ 2023-01-02 ┆ true ┆ 2 │
│ 2023-01-03 ┆ false ┆ 2 │
│ 2023-01-05 ┆ true ┆ 2 │
│ 2023-01-10 ┆ false ┆ 1 │
│ 2023-01-11 ┆ false ┆ 1 │
│ 2023-01-12 ┆ true ┆ 1 │
└────────────┴─────────┴─────┘
You could use .groupby_rolling
with an integer column as the index.
pl.arange(0, pl.count())
is one way to generate such a column.
(df
.with_columns(idx = pl.arange(0, pl.count()))
.groupby_rolling(index_column="idx", period="3i").agg(
pl.col("Date").last(),
(pl.col("Pattern") == True).sum())
)
shape: (7, 3)
┌─────┬────────────┬─────────┐
│ idx ┆ Date ┆ Pattern │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪════════════╪═════════╡
│ 0 ┆ 2023-01-01 ┆ 1 │
│ 1 ┆ 2023-01-02 ┆ 2 │
│ 2 ┆ 2023-01-03 ┆ 2 │
│ 3 ┆ 2023-01-05 ┆ 2 │
│ 4 ┆ 2023-01-10 ┆ 1 │
│ 5 ┆ 2023-01-11 ┆ 1 │
│ 6 ┆ 2023-01-12 ┆ 1 │
└─────┴────────────┴─────────┘
To produce the null
values you could add a check with pl.when()
(df
.with_columns(idx = pl.arange(0, pl.count()))
.groupby_rolling(index_column="idx", period="3i").agg(
pl.col("Date").last(),
pl.when(pl.count() > 2)
.then(
(pl.col("Pattern") == True).sum()))
)
shape: (7, 3)
┌─────┬────────────┬─────────┐
│ idx ┆ Date ┆ Pattern │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪════════════╪═════════╡
│ 0 ┆ 2023-01-01 ┆ null │
│ 1 ┆ 2023-01-02 ┆ null │
│ 2 ┆ 2023-01-03 ┆ 2 │
│ 3 ┆ 2023-01-05 ┆ 2 │
│ 4 ┆ 2023-01-10 ┆ 1 │
│ 5 ┆ 2023-01-11 ┆ 1 │
│ 6 ┆ 2023-01-12 ┆ 1 │
└─────┴────────────┴─────────┘
I have tried using rolling_sum() over the column Pattern, yet since my column is of Boolean type, using such functions would only yield an error.
You can use rolling_sum
, you just need to first cast the boolean expression to a numeric type using Expr.cast
(False
values are converted to 0
and True
values to 1
).
n = 3
pattern_value = True
res = df.with_columns(
(pl.col('Pattern') == pattern_value).cast(pl.UInt8).rolling_sum(window_size=n)
)
For pattern_value = True
using (pl.col('Pattern') == pattern_value)
is redundant, you can just use pl.col('Pattern').cast(pl.UInt8)
.
Output:
>>> res
shape: (7, 2)
┌────────────┬─────────┐
│ Date ┆ Pattern │
│ --- ┆ --- │
│ str ┆ u8 │
╞════════════╪═════════╡
│ 2023-01-01 ┆ null │
│ 2023-01-02 ┆ null │
│ 2023-01-03 ┆ 2 │
│ 2023-01-05 ┆ 2 │
│ 2023-01-10 ┆ 1 │
│ 2023-01-11 ┆ 1 │
│ 2023-01-12 ┆ 1 │
└────────────┴─────────┘
In pandas, this can be achieved by:
df["Pattern"].apply(lambda x: x == True).rolling(3, min_periods = 0).sum()
Note that using min_periods = 0
does not produce the output mentioned. The Expr.rolling_sum
method also accepts the min_periods
parameter if you want (which defaults to the window size as well). For instance
>>> df.with_columns(
pl.col('Pattern').cast(pl.UInt8).rolling_sum(window_size=3, min_periods=0)
)
shape: (7, 2)
┌────────────┬─────────┐
│ Date ┆ Pattern │
│ --- ┆ --- │
│ str ┆ u8 │
╞════════════╪═════════╡
│ 2023-01-01 ┆ 1 │
│ 2023-01-02 ┆ 2 │
│ 2023-01-03 ┆ 2 │
│ 2023-01-05 ┆ 2 │
│ 2023-01-10 ┆ 1 │
│ 2023-01-11 ┆ 1 │
│ 2023-01-12 ┆ 1 │
└────────────┴─────────┘
There are some known rolling functions in polars, namely rolling_mean(), rolling_apply() and rolling_max(). However, if I would like to get a count on the number of occurrence of a value in each window, how should that be done?
Let’s say we now have a LazyFrame:
df = pl.LazyFrame({"Date": ["2023-01-01", "2023-01-02", "2023-01-03", "2023-01-05", "2023-01-10", "2023-01-11", "2023-01-12"], "Pattern": [True, True, False, True, False, False, True]})
┌────────────┬─────────┐
│ Date ┆ Pattern │
│ --- ┆ --- │
│ str ┆ bool │
╞════════════╪═════════╡
│ 2023-01-01 ┆ true │
│ 2023-01-02 ┆ true │
│ 2023-01-03 ┆ false │
│ 2023-01-05 ┆ true │
│ 2023-01-10 ┆ false │
│ 2023-01-11 ┆ false │
│ 2023-01-12 ┆ true │
└────────────┴─────────┘
And the desired out come, for n = 3 and pattern = True, would be:
┌────────────┬─────────┐
│ Date ┆ Count │
│ --- ┆ --- │
│ str ┆ int │
╞════════════╪═════════╡
│ 2023-01-01 ┆ null │
│ 2023-01-02 ┆ null │
│ 2023-01-03 ┆ 2 │
│ 2023-01-05 ┆ 2 │
│ 2023-01-10 ┆ 1 │
│ 2023-01-11 ┆ 1 │
│ 2023-01-12 ┆ 1 │
└────────────┴─────────┘
I have tried using rolling_sum() over the column Pattern, yet since my column is of Boolean type, using such functions would only yield an error.
In pandas, this can be achieved by:
df["Pattern"].apply(lambda x: x == True).rolling(3, min_periods = 0).sum()
What is the proper way of doing it with polars? and, how can I generalise the solution to columns other than the True value of Boolean, for example, False and categorical data?
Update: A perhaps more polars approach to the min_periods = 0
example:
window_size = 3
df.with_columns(
pl.sum(
pl.col("Pattern").shift(n) == True
for n in range(window_size))
)
shape: (7, 3)
┌────────────┬─────────┬─────┐
│ Date ┆ Pattern ┆ sum │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ u32 │
╞════════════╪═════════╪═════╡
│ 2023-01-01 ┆ true ┆ 1 │
│ 2023-01-02 ┆ true ┆ 2 │
│ 2023-01-03 ┆ false ┆ 2 │
│ 2023-01-05 ┆ true ┆ 2 │
│ 2023-01-10 ┆ false ┆ 1 │
│ 2023-01-11 ┆ false ┆ 1 │
│ 2023-01-12 ┆ true ┆ 1 │
└────────────┴─────────┴─────┘
You could use .groupby_rolling
with an integer column as the index.
pl.arange(0, pl.count())
is one way to generate such a column.
(df
.with_columns(idx = pl.arange(0, pl.count()))
.groupby_rolling(index_column="idx", period="3i").agg(
pl.col("Date").last(),
(pl.col("Pattern") == True).sum())
)
shape: (7, 3)
┌─────┬────────────┬─────────┐
│ idx ┆ Date ┆ Pattern │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪════════════╪═════════╡
│ 0 ┆ 2023-01-01 ┆ 1 │
│ 1 ┆ 2023-01-02 ┆ 2 │
│ 2 ┆ 2023-01-03 ┆ 2 │
│ 3 ┆ 2023-01-05 ┆ 2 │
│ 4 ┆ 2023-01-10 ┆ 1 │
│ 5 ┆ 2023-01-11 ┆ 1 │
│ 6 ┆ 2023-01-12 ┆ 1 │
└─────┴────────────┴─────────┘
To produce the null
values you could add a check with pl.when()
(df
.with_columns(idx = pl.arange(0, pl.count()))
.groupby_rolling(index_column="idx", period="3i").agg(
pl.col("Date").last(),
pl.when(pl.count() > 2)
.then(
(pl.col("Pattern") == True).sum()))
)
shape: (7, 3)
┌─────┬────────────┬─────────┐
│ idx ┆ Date ┆ Pattern │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪════════════╪═════════╡
│ 0 ┆ 2023-01-01 ┆ null │
│ 1 ┆ 2023-01-02 ┆ null │
│ 2 ┆ 2023-01-03 ┆ 2 │
│ 3 ┆ 2023-01-05 ┆ 2 │
│ 4 ┆ 2023-01-10 ┆ 1 │
│ 5 ┆ 2023-01-11 ┆ 1 │
│ 6 ┆ 2023-01-12 ┆ 1 │
└─────┴────────────┴─────────┘
I have tried using rolling_sum() over the column Pattern, yet since my column is of Boolean type, using such functions would only yield an error.
You can use rolling_sum
, you just need to first cast the boolean expression to a numeric type using Expr.cast
(False
values are converted to 0
and True
values to 1
).
n = 3
pattern_value = True
res = df.with_columns(
(pl.col('Pattern') == pattern_value).cast(pl.UInt8).rolling_sum(window_size=n)
)
For pattern_value = True
using (pl.col('Pattern') == pattern_value)
is redundant, you can just use pl.col('Pattern').cast(pl.UInt8)
.
Output:
>>> res
shape: (7, 2)
┌────────────┬─────────┐
│ Date ┆ Pattern │
│ --- ┆ --- │
│ str ┆ u8 │
╞════════════╪═════════╡
│ 2023-01-01 ┆ null │
│ 2023-01-02 ┆ null │
│ 2023-01-03 ┆ 2 │
│ 2023-01-05 ┆ 2 │
│ 2023-01-10 ┆ 1 │
│ 2023-01-11 ┆ 1 │
│ 2023-01-12 ┆ 1 │
└────────────┴─────────┘
In pandas, this can be achieved by:
df["Pattern"].apply(lambda x: x == True).rolling(3, min_periods = 0).sum()
Note that using min_periods = 0
does not produce the output mentioned. The Expr.rolling_sum
method also accepts the min_periods
parameter if you want (which defaults to the window size as well). For instance
>>> df.with_columns(
pl.col('Pattern').cast(pl.UInt8).rolling_sum(window_size=3, min_periods=0)
)
shape: (7, 2)
┌────────────┬─────────┐
│ Date ┆ Pattern │
│ --- ┆ --- │
│ str ┆ u8 │
╞════════════╪═════════╡
│ 2023-01-01 ┆ 1 │
│ 2023-01-02 ┆ 2 │
│ 2023-01-03 ┆ 2 │
│ 2023-01-05 ┆ 2 │
│ 2023-01-10 ┆ 1 │
│ 2023-01-11 ┆ 1 │
│ 2023-01-12 ┆ 1 │
└────────────┴─────────┘