Python Polars group by on both time and categorical values

Question:

There is a polars dataframe which consists of 3 fields listed below.

user_id date part_of_day
i32 datetime[ns] cat
173367 2021-08-03 00:00:00 "day"
132702 2021-10-28 00:00:00 "evening"
100853 2021-07-29 00:00:00 "night"
305810 2021-08-24 00:00:00 "day"
305239 2021-08-13 00:00:00 "day"

My task is to calculate the number of unique users for each week and time of day. Polars provides three different types of grouping options, which are available as of version 0.16.8:

  1. Groupby: This option allows for simple grouping by columns or expressions.
  2. Groupby_dynamic: This option only accepts a datetime column as the grouper.
  3. Groupby_rolling: This option also only accepts a datetime column as the grouper.

It seems like groupby_dynamic would be the best fit for this specific task. However, it does not allow for other columns to be used as the grouper. Therefore, my question is how can I accomplish this task using Polars?

Additionally, I have included some code that I would have used to solve this problem using Pandas.

(
    df
    .groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
    .user_id
    .nunique()
)
Asked By: Alexandr Yusov

||

Answers:

It’s nice if you provide your example in code:

import io
import pandas as pd
import polars as pl

csv = """
user_id,date,part_of_day
173367,2021-08-03T00:00:00.000000,day
132702,2021-10-28T00:00:00.000000,evening
100853,2021-07-29T00:00:00.000000,night
305810,2021-08-24T00:00:00.000000,day
305239,2021-08-13T00:00:00.000000,day
"""

df = pl.read_csv(
   io.StringIO(csv), 
   try_parse_dates=True, 
   dtypes={"part_of_day": pl.Categorical}
)

Your pandas code:

(df.with_columns(pl.col("part_of_day").cast(pl.Utf8))
 .to_pandas()
 .groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
 .user_id
 .nunique()
 .reset_index())
        date part_of_day  user_id
0 2021-08-01       night        1
1 2021-08-08         day        1
2 2021-08-15         day        1
3 2021-08-29         day        1
4 2021-10-31     evening        1

It looks like the it’s doing the equivalent of some date math combined with dt.truncate

(df
 .with_columns(
    (pl.col("date") + pl.duration(weeks=1)).dt.truncate("1w") 
     - pl.duration(days=1))
 .groupby("date", "part_of_day")
 .agg(pl.col("user_id").n_unique())
 .sort("date"))
shape: (5, 3)
┌─────────────────────┬─────────────┬─────────┐
│ date                | part_of_day | user_id │
│ ---                 | ---         | ---     │
│ datetime[μs]        | cat         | u32     │
╞═════════════════════╪═════════════╪═════════╡
│ 2021-08-01 00:00:00 | night       | 1       │
│ 2021-08-08 00:00:00 | day         | 1       │
│ 2021-08-15 00:00:00 | day         | 1       │
│ 2021-08-29 00:00:00 | day         | 1       │
│ 2021-10-31 00:00:00 | evening     | 1       │
└─────────────────────┴─────────────┴─────────┘
Answered By: jqurious
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.