Accumulating lists in Polars
Question:
Say I have a pl.DataFrame()
with 2 columns: The first column contains Date
, the second List[str]
.
import polars as pl
df = pl.DataFrame([
pl.Series('Date', [2000, 2001, 2002]),
pl.Series('Ids', [
['a'],
['b', 'c'],
['d'],
])
])
Date
Ids
2000
['a']
2001
['b', 'c']
2002
['d']
Is it possible to accumulate the List[str]
column so that each row contains itself and all previous lists in Polars? Like so:
Date
Ids
2000
['a']
2001
['a', 'b', 'c']
2002
['a', 'b', 'c', 'd']
Answers:
Here is what I have so far:
(
df.select([
pl.col('Date'),
pl.col('Doi').cumulative_eval(pl.element().list()).arr.eval(pl.element().flatten())
])
)
If anyone has something better, I’ll give them the answer.
Looks like a rolling groupby?
(df.groupby_rolling(index_column="Date", period=f"{df.height}i")
.agg(pl.col("Ids").flatten()))
shape: (3, 2)
┌──────┬─────────────────────┐
│ Date | Ids │
│ --- | --- │
│ i64 | list[str] │
╞══════╪═════════════════════╡
│ 2000 | ["a"] │
│ 2001 | ["a", "b", "c"] │
│ 2002 | ["a", "b", ... "d"] │
└──────┴─────────────────────┘
The index_column
is not particularly relevant for your use-case, we’re just using Date
here as it is an int
.
Instead, a common approach is to add a "row count" column to use.
It needs to be cast in order to be used with .groupby_rolling
(df.with_row_count()
.with_columns(pl.col("row_nr").cast(pl.Int64))
.groupby_rolling(index_column="row_nr", period=f"{df.height}i")
.agg(pl.exclude("row_nr")))
shape: (3, 3)
┌────────┬────────────────────┬────────────────────────────┐
│ row_nr | Date | Ids │
│ --- | --- | --- │
│ i64 | list[i64] | list[list[str]] │
╞════════╪════════════════════╪════════════════════════════╡
│ 0 | [2000] | [["a"]] │
│ 1 | [2000, 2001] | [["a"], ["b", "c"]] │
│ 2 | [2000, 2001, 2002] | [["a"], ["b", "c"], ["d"]] │
└────────┴────────────────────┴────────────────────────────┘
You can use .last()
on columns where you want only the original value:
(df.with_row_count()
.with_columns(pl.col("row_nr").cast(pl.Int64))
.groupby_rolling(index_column="row_nr", period=f"{df.height}i")
.agg(pl.col("Date").last(), pl.col("Ids").flatten()))
shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids │
│ --- | --- | --- │
│ i64 | i64 | list[str] │
╞════════╪══════╪═════════════════════╡
│ 0 | 2000 | ["a"] │
│ 1 | 2001 | ["a", "b", "c"] │
│ 2 | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘
Using .arange()
is another way to get a "row count" – it produces an int
which allows skipping the .cast
– which some prefer.
(df.with_columns(row_nr = pl.arange(0, pl.count()))
.groupby_rolling(index_column="row_nr", period=f"{df.height}i")
.agg(pl.col("Date").last(), pl.col("Ids").flatten()))
shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids │
│ --- | --- | --- │
│ i64 | i64 | list[str] │
╞════════╪══════╪═════════════════════╡
│ 0 | 2000 | ["a"] │
│ 1 | 2001 | ["a", "b", "c"] │
│ 2 | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘
Say I have a pl.DataFrame()
with 2 columns: The first column contains Date
, the second List[str]
.
import polars as pl
df = pl.DataFrame([
pl.Series('Date', [2000, 2001, 2002]),
pl.Series('Ids', [
['a'],
['b', 'c'],
['d'],
])
])
Date | Ids |
---|---|
2000 | ['a'] |
2001 | ['b', 'c'] |
2002 | ['d'] |
Is it possible to accumulate the List[str]
column so that each row contains itself and all previous lists in Polars? Like so:
Date | Ids |
---|---|
2000 | ['a'] |
2001 | ['a', 'b', 'c'] |
2002 | ['a', 'b', 'c', 'd'] |
Here is what I have so far:
(
df.select([
pl.col('Date'),
pl.col('Doi').cumulative_eval(pl.element().list()).arr.eval(pl.element().flatten())
])
)
If anyone has something better, I’ll give them the answer.
Looks like a rolling groupby?
(df.groupby_rolling(index_column="Date", period=f"{df.height}i")
.agg(pl.col("Ids").flatten()))
shape: (3, 2)
┌──────┬─────────────────────┐
│ Date | Ids │
│ --- | --- │
│ i64 | list[str] │
╞══════╪═════════════════════╡
│ 2000 | ["a"] │
│ 2001 | ["a", "b", "c"] │
│ 2002 | ["a", "b", ... "d"] │
└──────┴─────────────────────┘
The index_column
is not particularly relevant for your use-case, we’re just using Date
here as it is an int
.
Instead, a common approach is to add a "row count" column to use.
It needs to be cast in order to be used with .groupby_rolling
(df.with_row_count()
.with_columns(pl.col("row_nr").cast(pl.Int64))
.groupby_rolling(index_column="row_nr", period=f"{df.height}i")
.agg(pl.exclude("row_nr")))
shape: (3, 3)
┌────────┬────────────────────┬────────────────────────────┐
│ row_nr | Date | Ids │
│ --- | --- | --- │
│ i64 | list[i64] | list[list[str]] │
╞════════╪════════════════════╪════════════════════════════╡
│ 0 | [2000] | [["a"]] │
│ 1 | [2000, 2001] | [["a"], ["b", "c"]] │
│ 2 | [2000, 2001, 2002] | [["a"], ["b", "c"], ["d"]] │
└────────┴────────────────────┴────────────────────────────┘
You can use .last()
on columns where you want only the original value:
(df.with_row_count()
.with_columns(pl.col("row_nr").cast(pl.Int64))
.groupby_rolling(index_column="row_nr", period=f"{df.height}i")
.agg(pl.col("Date").last(), pl.col("Ids").flatten()))
shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids │
│ --- | --- | --- │
│ i64 | i64 | list[str] │
╞════════╪══════╪═════════════════════╡
│ 0 | 2000 | ["a"] │
│ 1 | 2001 | ["a", "b", "c"] │
│ 2 | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘
Using .arange()
is another way to get a "row count" – it produces an int
which allows skipping the .cast
– which some prefer.
(df.with_columns(row_nr = pl.arange(0, pl.count()))
.groupby_rolling(index_column="row_nr", period=f"{df.height}i")
.agg(pl.col("Date").last(), pl.col("Ids").flatten()))
shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids │
│ --- | --- | --- │
│ i64 | i64 | list[str] │
╞════════╪══════╪═════════════════════╡
│ 0 | 2000 | ["a"] │
│ 1 | 2001 | ["a", "b", "c"] │
│ 2 | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘