Accumulating lists in Polars

Question:

Say I have a pl.DataFrame() with 2 columns: The first column contains Date, the second List[str].

import polars as pl

df = pl.DataFrame([
    pl.Series('Date', [2000, 2001, 2002]),
    pl.Series('Ids', [
        ['a'], 
        ['b', 'c'], 
        ['d'], 
    ])
])
Date Ids
2000 ['a']
2001 ['b', 'c']
2002 ['d']

Is it possible to accumulate the List[str] column so that each row contains itself and all previous lists in Polars? Like so:

Date Ids
2000 ['a']
2001 ['a', 'b', 'c']
2002 ['a', 'b', 'c', 'd']
Asked By: Neotenic Primate

||

Answers:

Here is what I have so far:

(
    df.select([
        pl.col('Date'),
        pl.col('Doi').cumulative_eval(pl.element().list()).arr.eval(pl.element().flatten())
    ])
)

If anyone has something better, I’ll give them the answer.

Answered By: Neotenic Primate

Looks like a rolling groupby?

(df.groupby_rolling(index_column="Date", period=f"{df.height}i")
   .agg(pl.col("Ids").flatten()))
shape: (3, 2)
┌──────┬─────────────────────┐
│ Date | Ids                 │
│ ---  | ---                 │
│ i64  | list[str]           │
╞══════╪═════════════════════╡
│ 2000 | ["a"]               │
│ 2001 | ["a", "b", "c"]     │
│ 2002 | ["a", "b", ... "d"] │
└──────┴─────────────────────┘

The index_column is not particularly relevant for your use-case, we’re just using Date here as it is an int.

Instead, a common approach is to add a "row count" column to use.

It needs to be cast in order to be used with .groupby_rolling

(df.with_row_count()
   .with_columns(pl.col("row_nr").cast(pl.Int64))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.exclude("row_nr")))
shape: (3, 3)
┌────────┬────────────────────┬────────────────────────────┐
│ row_nr | Date               | Ids                        │
│ ---    | ---                | ---                        │
│ i64    | list[i64]          | list[list[str]]            │
╞════════╪════════════════════╪════════════════════════════╡
│ 0      | [2000]             | [["a"]]                    │
│ 1      | [2000, 2001]       | [["a"], ["b", "c"]]        │
│ 2      | [2000, 2001, 2002] | [["a"], ["b", "c"], ["d"]] │
└────────┴────────────────────┴────────────────────────────┘

You can use .last() on columns where you want only the original value:

(df.with_row_count()
   .with_columns(pl.col("row_nr").cast(pl.Int64))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.col("Date").last(), pl.col("Ids").flatten()))
shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids                 │
│ ---    | ---  | ---                 │
│ i64    | i64  | list[str]           │
╞════════╪══════╪═════════════════════╡
│ 0      | 2000 | ["a"]               │
│ 1      | 2001 | ["a", "b", "c"]     │
│ 2      | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘

Using .arange() is another way to get a "row count" – it produces an int which allows skipping the .cast – which some prefer.

(df.with_columns(row_nr = pl.arange(0, pl.count()))
   .groupby_rolling(index_column="row_nr", period=f"{df.height}i")
   .agg(pl.col("Date").last(), pl.col("Ids").flatten()))
shape: (3, 3)
┌────────┬──────┬─────────────────────┐
│ row_nr | Date | Ids                 │
│ ---    | ---  | ---                 │
│ i64    | i64  | list[str]           │
╞════════╪══════╪═════════════════════╡
│ 0      | 2000 | ["a"]               │
│ 1      | 2001 | ["a", "b", "c"]     │
│ 2      | 2002 | ["a", "b", ... "d"] │
└────────┴──────┴─────────────────────┘
Answered By: jqurious
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.