Python Polars Window Function With Literal Type

Question:

Say I have a DataFrame with an id column like this:

┌─────┐
│ id  │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 1   │
│ 1   │
│ 2   │
│ 2   │
│ 3   │
│ 3   │
└─────┘

I want to aggregate a running count over the id column, giving this result:

┌─────┬───────┐
│ id  ┆ count │
│ --- ┆ ---   │
│ i64 ┆ i64   │
╞═════╪═══════╡
│ 1   ┆ 1     │
│ 1   ┆ 2     │
│ 1   ┆ 3     │
│ 2   ┆ 1     │
│ 2   ┆ 2     │
│ 3   ┆ 1     │
│ 3   ┆ 2     │
└─────┴───────┘

My attempt involved creating a dummy column, which I think produced the desired result but seems a bit hacky.

(
    df.with_columns(
        pl.lit(1).alias("ones")
    )
    .with_columns(
        (pl.col("ones").cumsum().over("id")).alias("count")
    )
    .drop("ones")
)

However when I try this:

(
    df.with_columns(
        (pl.lit(1).cumsum().over("id")).alias("count")
    )
    .drop("ones")
)

I get the error "ComputeError: the length of the window expression did not match that of the group".

Is there a better way to do this? What am I missing in my attempt above?

Asked By: bkw1491

||

Answers:

cumcount seems to do the job with a 1-shift adjustment (or any shift you want, of course).

There is also arange:

df.with_columns(
   count = 1 + pl.col('id').cumcount().over('id'),
   count2 = pl.arange(1, 1 + pl.count()).over('id')
)
Answered By: Wayoshi
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.