Python Polars group by on both time and categorical values
Question:
There is a polars dataframe which consists of 3 fields listed below.
user_id
date
part_of_day
i32
datetime[ns]
cat
173367
2021-08-03 00:00:00
"day"
132702
2021-10-28 00:00:00
"evening"
100853
2021-07-29 00:00:00
"night"
305810
2021-08-24 00:00:00
"day"
305239
2021-08-13 00:00:00
"day"
My task is to calculate the number of unique users for each week and time of day. Polars provides three different types of grouping options, which are available as of version 0.16.8:
- Groupby: This option allows for simple grouping by columns or expressions.
- Groupby_dynamic: This option only accepts a datetime column as the grouper.
- Groupby_rolling: This option also only accepts a datetime column as the grouper.
It seems like groupby_dynamic would be the best fit for this specific task. However, it does not allow for other columns to be used as the grouper. Therefore, my question is how can I accomplish this task using Polars?
Additionally, I have included some code that I would have used to solve this problem using Pandas.
(
df
.groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
.user_id
.nunique()
)
Answers:
It’s nice if you provide your example in code:
import io
import pandas as pd
import polars as pl
csv = """
user_id,date,part_of_day
173367,2021-08-03T00:00:00.000000,day
132702,2021-10-28T00:00:00.000000,evening
100853,2021-07-29T00:00:00.000000,night
305810,2021-08-24T00:00:00.000000,day
305239,2021-08-13T00:00:00.000000,day
"""
df = pl.read_csv(
io.StringIO(csv),
try_parse_dates=True,
dtypes={"part_of_day": pl.Categorical}
)
Your pandas code:
(df.with_columns(pl.col("part_of_day").cast(pl.Utf8))
.to_pandas()
.groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
.user_id
.nunique()
.reset_index())
date part_of_day user_id
0 2021-08-01 night 1
1 2021-08-08 day 1
2 2021-08-15 day 1
3 2021-08-29 day 1
4 2021-10-31 evening 1
It looks like the it’s doing the equivalent of some date math combined with dt.truncate
(df
.with_columns(
(pl.col("date") + pl.duration(weeks=1)).dt.truncate("1w")
- pl.duration(days=1))
.groupby("date", "part_of_day")
.agg(pl.col("user_id").n_unique())
.sort("date"))
shape: (5, 3)
┌─────────────────────┬─────────────┬─────────┐
│ date | part_of_day | user_id │
│ --- | --- | --- │
│ datetime[μs] | cat | u32 │
╞═════════════════════╪═════════════╪═════════╡
│ 2021-08-01 00:00:00 | night | 1 │
│ 2021-08-08 00:00:00 | day | 1 │
│ 2021-08-15 00:00:00 | day | 1 │
│ 2021-08-29 00:00:00 | day | 1 │
│ 2021-10-31 00:00:00 | evening | 1 │
└─────────────────────┴─────────────┴─────────┘
There is a polars dataframe which consists of 3 fields listed below.
user_id | date | part_of_day |
---|---|---|
i32 | datetime[ns] | cat |
173367 | 2021-08-03 00:00:00 | "day" |
132702 | 2021-10-28 00:00:00 | "evening" |
100853 | 2021-07-29 00:00:00 | "night" |
305810 | 2021-08-24 00:00:00 | "day" |
305239 | 2021-08-13 00:00:00 | "day" |
My task is to calculate the number of unique users for each week and time of day. Polars provides three different types of grouping options, which are available as of version 0.16.8:
- Groupby: This option allows for simple grouping by columns or expressions.
- Groupby_dynamic: This option only accepts a datetime column as the grouper.
- Groupby_rolling: This option also only accepts a datetime column as the grouper.
It seems like groupby_dynamic would be the best fit for this specific task. However, it does not allow for other columns to be used as the grouper. Therefore, my question is how can I accomplish this task using Polars?
Additionally, I have included some code that I would have used to solve this problem using Pandas.
(
df
.groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
.user_id
.nunique()
)
It’s nice if you provide your example in code:
import io
import pandas as pd
import polars as pl
csv = """
user_id,date,part_of_day
173367,2021-08-03T00:00:00.000000,day
132702,2021-10-28T00:00:00.000000,evening
100853,2021-07-29T00:00:00.000000,night
305810,2021-08-24T00:00:00.000000,day
305239,2021-08-13T00:00:00.000000,day
"""
df = pl.read_csv(
io.StringIO(csv),
try_parse_dates=True,
dtypes={"part_of_day": pl.Categorical}
)
Your pandas code:
(df.with_columns(pl.col("part_of_day").cast(pl.Utf8))
.to_pandas()
.groupby([pd.Grouper(key="date", freq="1W"), "part_of_day"])
.user_id
.nunique()
.reset_index())
date part_of_day user_id
0 2021-08-01 night 1
1 2021-08-08 day 1
2 2021-08-15 day 1
3 2021-08-29 day 1
4 2021-10-31 evening 1
It looks like the it’s doing the equivalent of some date math combined with dt.truncate
(df
.with_columns(
(pl.col("date") + pl.duration(weeks=1)).dt.truncate("1w")
- pl.duration(days=1))
.groupby("date", "part_of_day")
.agg(pl.col("user_id").n_unique())
.sort("date"))
shape: (5, 3)
┌─────────────────────┬─────────────┬─────────┐
│ date | part_of_day | user_id │
│ --- | --- | --- │
│ datetime[μs] | cat | u32 │
╞═════════════════════╪═════════════╪═════════╡
│ 2021-08-01 00:00:00 | night | 1 │
│ 2021-08-08 00:00:00 | day | 1 │
│ 2021-08-15 00:00:00 | day | 1 │
│ 2021-08-29 00:00:00 | day | 1 │
│ 2021-10-31 00:00:00 | evening | 1 │
└─────────────────────┴─────────────┴─────────┘