Accumulating lists in Polars

Question

Say I have a pl.DataFrame() with 2 columns: The first column contains Date, the second List[str].

import polars as pl

df = pl.DataFrame([
    pl.Series('Date', [2000, 2001, 2002]),
    pl.Series('Ids', [
        ['a'], 
        ['b', 'c'], 
        ['d'], 
    ])
])

Date	Ids
2000	`['a']`
2001	`['b', 'c']`
2002	`['d']`

Is it possible to accumulate the List[str] column so that each row contains itself and all previous lists in Polars? Like so:

Date	Ids
2000	`['a']`
2001	`['a', 'b', 'c']`
2002	`['a', 'b', 'c', 'd']`

Asked By: Neotenic Primate

||

Source

Answer 1

Here is what I have so far:

(
    df.select([
        pl.col('Date'),
        pl.col('Doi').cumulative_eval(pl.element().list()).arr.eval(pl.element().flatten())
    ])
)

If anyone has something better, I’ll give them the answer.

Answered By: Neotenic Primate

Answer 2

Looks like a rolling groupby?

(df.groupby_rolling(index_column="Date", period=f"{df.height}i")
   .agg(pl.col("Ids").flatten()))

shape: (3, 2)
┌──────┬─────────────────────┐
│ Date | Ids                 │
│ ---  | ---                 │
│ i64  | list[str]           │
╞══════╪═════════════════════╡
│ 2000 | ["a"]               │
│ 2001 | ["a", "b", "c"]     │
│ 2002 | ["a", "b", ... "d"] │
└──────┴─────────────────────┘

The index_column is not particularly relevant for your use-case, we’re just using Date here as it is an int.

Instead, a common approach is to add a "row count" column to use.

It needs to be cast in order to be used with .groupby_rolling

(df.with_row_count()
   .with_columns(pl.col("row_nr").cast(pl.Int64))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.exclude("row_nr")))

shape: (3, 3)
┌────────┬────────────────────┬────────────────────────────┐
│ row_nr | Date               | Ids                        │
│ ---    | ---                | ---                        │
│ i64    | list[i64]          | list[list[str]]            │
╞════════╪════════════════════╪════════════════════════════╡
│ 0      | [2000]             | [["a"]]                    │
│ 1      | [2000, 2001]       | [["a"], ["b", "c"]]        │
│ 2      | [2000, 2001, 2002] | [["a"], ["b", "c"], ["d"]] │
└────────┴────────────────────┴────────────────────────────┘

You can use .last() on columns where you want only the original value:

(df.with_row_count()
   .with_columns(pl.col("row_nr").cast(pl.Int64))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.col("Date").last(), pl.col("Ids").flatten()))

shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids                 │
│ ---    | ---  | ---                 │
│ i64    | i64  | list[str]           │
╞════════╪══════╪═════════════════════╡
│ 0      | 2000 | ["a"]               │
│ 1      | 2001 | ["a", "b", "c"]     │
│ 2      | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘

Using .arange() is another way to get a "row count" – it produces an int which allows skipping the .cast – which some prefer.

(df.with_columns(row_nr = pl.arange(0, pl.count()))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.col("Date").last(), pl.col("Ids").flatten()))

shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids                 │
│ ---    | ---  | ---                 │
│ i64    | i64  | list[str]           │
╞════════╪══════╪═════════════════════╡
│ 0      | 2000 | ["a"]               │
│ 1      | 2001 | ["a", "b", "c"]     │
│ 2      | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘

Answered By: jqurious

Accumulating lists in Polars

Question:

Answers: