Python Polars group by on both time and categorical values

Question

There is a polars dataframe which consists of 3 fields listed below.

user_id	date	part_of_day
i32	datetime[ns]	cat
173367	2021-08-03 00:00:00	"day"
132702	2021-10-28 00:00:00	"evening"
100853	2021-07-29 00:00:00	"night"
305810	2021-08-24 00:00:00	"day"
305239	2021-08-13 00:00:00	"day"

My task is to calculate the number of unique users for each week and time of day. Polars provides three different types of grouping options, which are available as of version 0.16.8:

Groupby: This option allows for simple grouping by columns or expressions.
Groupby_dynamic: This option only accepts a datetime column as the grouper.
Groupby_rolling: This option also only accepts a datetime column as the grouper.

It seems like groupby_dynamic would be the best fit for this specific task. However, it does not allow for other columns to be used as the grouper. Therefore, my question is how can I accomplish this task using Polars?

Additionally, I have included some code that I would have used to solve this problem using Pandas.

(
    df
    .groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
    .user_id
    .nunique()
)

Asked By: Alexandr Yusov

||

Source

Answer 1

It’s nice if you provide your example in code:

import io
import pandas as pd
import polars as pl

csv = """
user_id,date,part_of_day
173367,2021-08-03T00:00:00.000000,day
132702,2021-10-28T00:00:00.000000,evening
100853,2021-07-29T00:00:00.000000,night
305810,2021-08-24T00:00:00.000000,day
305239,2021-08-13T00:00:00.000000,day
"""

df = pl.read_csv(
   io.StringIO(csv), 
   try_parse_dates=True, 
   dtypes={"part_of_day": pl.Categorical}
)

Your pandas code:

(df.with_columns(pl.col("part_of_day").cast(pl.Utf8))
 .to_pandas()
 .groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
 .user_id
 .nunique()
 .reset_index())

        date part_of_day  user_id
0 2021-08-01       night        1
1 2021-08-08         day        1
2 2021-08-15         day        1
3 2021-08-29         day        1
4 2021-10-31     evening        1

It looks like the it’s doing the equivalent of some date math combined with dt.truncate

(df
 .with_columns(
    (pl.col("date") + pl.duration(weeks=1)).dt.truncate("1w") 
     - pl.duration(days=1))
 .groupby("date", "part_of_day")
 .agg(pl.col("user_id").n_unique())
 .sort("date"))

shape: (5, 3)
┌─────────────────────┬─────────────┬─────────┐
│ date                | part_of_day | user_id │
│ ---                 | ---         | ---     │
│ datetime[μs]        | cat         | u32     │
╞═════════════════════╪═════════════╪═════════╡
│ 2021-08-01 00:00:00 | night       | 1       │
│ 2021-08-08 00:00:00 | day         | 1       │
│ 2021-08-15 00:00:00 | day         | 1       │
│ 2021-08-29 00:00:00 | day         | 1       │
│ 2021-10-31 00:00:00 | evening     | 1       │
└─────────────────────┴─────────────┴─────────┘

Answered By: jqurious

Python Polars group by on both time and categorical values

Question:

Answers: