Multiple aggregations on multiple columns in Python polars

Question:

Checking out how to implement binning with Python polars, I can easily calculate aggregates for individual columns:

import polars as pl
import numpy as np

t, v = np.arange(0, 100, 2), np.arange(0, 100, 2)
df = pl.DataFrame({"t": t, "v0": v, "v1": v})
df = df.with_column((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")

df.groupby_dynamic("datetime", every="10s").agg(pl.col("v0").mean())
┌─────────────────────┬──────┐
│ datetime            ┆ v0   │
│ ---                 ┆ ---  │
│ datetime[μs]        ┆ f64  │
╞═════════════════════╪══════╡
│ 2022-10-30 00:00:00 ┆ 4.0  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ...                 ┆ ...  │

or calculate multiple aggregations like

df.groupby_dynamic("datetime", every="10s").agg([
    pl.col("v0").mean().alias("v0_binmean"),
    pl.col("v0").count().alias("v0_bincount")
])
┌─────────────────────┬────────────┬─────────────┐
│ datetime            ┆ v0_binmean ┆ v0_bincount │
│ ---                 ┆ ---        ┆ ---         │
│ datetime[μs]        ┆ f64        ┆ u32         │
╞═════════════════════╪════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0        ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0       ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0       ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0       ┆ 5           │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...        ┆ ...         │

or calculate one aggregation for multiple columns like

cols = [c for c in df.columns if "datetime" not in c]
df.groupby_dynamic("datetime", every="10s").agg([
     pl.col(f"{c}").mean().alias(f"{c}_binmean")
     for c in cols
])
┌─────────────────────┬────────────┬────────────┐
│ datetime            ┆ v0_binmean ┆ v1_binmean │
│ ---                 ┆ ---        ┆ ---        │
│ datetime[μs]        ┆ f64        ┆ f64        │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0        ┆ 4.0        │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:10 ┆ 14.0       ┆ 14.0       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:20 ┆ 24.0       ┆ 24.0       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2022-10-30 00:00:30 ┆ 34.0       ┆ 34.0       │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 ┆ ...        ┆ ...        │

However, combining both approaches fails!

df.groupby_dynamic("datetime", every="10s").agg([
    [
    pl.col(f"{c}").mean().alias(f"{c}_binmean"),
    pl.col(f"{c}").count().alias(f"{c}_bincount")
    ]
    for c in cols
])
Traceback (most recent call last):

  File "/tmp/ipykernel_2666/421808935.py", line 2, in <cell line: 2>
    df.groupby_dynamic("datetime", every="10s").agg([

  File ".../3.10.9/lib/python3.10/site-packages/polars/internals/dataframe/groupby.py", line 924, in agg
    .agg(aggs)

  File ".../3.10.9/lib/python3.10/site-packages/polars/internals/lazyframe/groupby.py", line 55, in agg
    raise TypeError(msg)

TypeError: expected 'Expr | Sequence[Expr]', got '<class 'list'>'

Is there a "polarustic" approach to calculate multiple statistical parameters for multiple (all) columns of the dataframe in one go?

related, pandas-specific: Python pandas groupby aggregate on multiple columns

Asked By: FObersteiner

||

Answers:

There are different ways of selecting multiple columns "at once" in polars:

>>> df.select(pl.all()).columns
['v0', 'v1', 'datetime']
>>> df.select(pl.col(["v0", "v1"])).columns          # by name(s)
['v0', 'v1']
>>> df.select(pl.all().exclude("datetime")).columns  # by exclusion
['v0', 'v1']
>>> df.select(pl.exclude("datetime")).columns        # we can omit `.all()`
['v0', 'v1']

.suffix() can be used to append to the end of each name:

>>> df.select(pl.exclude("datetime").mean().suffix("_binmean"))
shape: (1, 2)
┌────────────┬────────────┐
│ v0_binmean | v1_binmean │
│ ---        | ---        │
│ f64        | f64        │
╞════════════╪════════════╡
│ 49.0       | 49.0       │
└────────────┴────────────┘

This means your example can be written as:

df.groupby_dynamic("datetime", every="10s").agg(
   pl.exclude("datetime").mean().suffix("_binmean")
)
shape: (10, 3)
┌─────────────────────┬────────────┬────────────┐
│ datetime            | v0_binmean | v1_binmean │
│ ---                 | ---        | ---        │
│ datetime[μs]        | f64        | f64        │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 | 4.0        | 4.0        │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:10 | 14.0       | 14.0       │
├─────────────────────┼────────────┼────────────┤
│ 2022-10-30 00:00:20 | 24.0       | 24.0       │

Multiple aggregations:

df.groupby_dynamic("datetime", every="10s").agg([
   pl.exclude("datetime").mean().suffix("_binmean"),
   pl.exclude("datetime").count().suffix("_bincount")
])
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime            | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ ---                 | ---        | ---        | ---         | ---         │
│ datetime[μs]        | f64        | f64        | u32         | u32         │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0        | 4.0        | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0       | 14.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0       | 24.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0       | 34.0       | 5           | 5           │

For a list comprehension – you’d need to combine the 2 into a single list:

df.groupby_dynamic("datetime", every="10s").agg(
   [pl.col(f"{c}").mean().alias(f"{c}_binmean") for c in cols] + 
   [pl.col(f"{c}").count().alias(f"{c}_bincount") for c in cols]
)
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime            | v0_binmean | v1_binmean | v0_bincount | v1_bincount │
│ ---                 | ---        | ---        | ---         | ---         │
│ datetime[μs]        | f64        | f64        | u32         | u32         │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 | 4.0        | 4.0        | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:10 | 14.0       | 14.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:20 | 24.0       | 24.0       | 5           | 5           │
├─────────────────────┼────────────┼────────────┼─────────────┼─────────────┤
│ 2022-10-30 00:00:30 | 34.0       | 34.0       | 5           | 5           │
Answered By: jqurious