How to use polars cut method returning result to original df
Question:
How can I use it in select context, such as df.with_columns?
To be more specific, if I have a polars dataframe with a lot of columns and one of them is called x, how can I do pl.cut on x and append the grouping result into the original dataframe?
Below is what I tried but it does not work:
df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [2, 3, 4, 5, 6], "x": [1, 3, 5, 7, 9]}
df.with_columns(pl.cut(pl.col("x"), bins=[2, 4, 6]))
Thanks so much for your help.
Answers:
From the docs, as of 2023-01-25, cut
takes a Series and returns a DataFrame. Unlike many/most methods and functions, it doesn’t take an expression so you can’t use it in a select
or with_column(s)
. To get your desired result you’d have to join it to your original df.
Additionally, it appears that cut
doesn’t necessarily maintain the same dtypes as the parent series. (This is most certainly a bug) As such you have to cast it back to, in this case, int.
You’d have:
df=df.join(
pl.cut(df.get_column('x'),bins=[2,4,6]).with_column(pl.col('x').cast(pl.Int64())),
on='x'
)
shape: (5, 5)
┌─────┬─────┬─────┬─────────────┬─────────────┐
│ a ┆ b ┆ x ┆ break_point ┆ category │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 ┆ cat │
╞═════╪═════╪═════╪═════════════╪═════════════╡
│ 1 ┆ 2 ┆ 1 ┆ 2.0 ┆ (-inf, 2.0] │
│ 2 ┆ 3 ┆ 3 ┆ 4.0 ┆ (2.0, 4.0] │
│ 3 ┆ 4 ┆ 5 ┆ 6.0 ┆ (4.0, 6.0] │
│ 4 ┆ 5 ┆ 7 ┆ inf ┆ (6.0, inf] │
│ 5 ┆ 6 ┆ 9 ┆ inf ┆ (6.0, inf] │
└─────┴─────┴─────┴─────────────┴─────────────┘
As of 0.16.8
, the top-level function pl.cut
has been deprecated. You have to use the series method .cut
instead now, which returns a three-column DataFrame.
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5],
"b": [2, 3, 4, 5, 6],
"x": [1, 3, 5, 7, 9]}
)
# get x column as a Series and then apply .cut method
df['x'].cut(bins=[2, 4, 6])
It returns a DataFrame like the following:
shape: (5, 3)
┌─────┬─────────────┬─────────────┐
│ x ┆ break_point ┆ category │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ cat │
╞═════╪═════════════╪═════════════╡
│ 1.0 ┆ 2.0 ┆ (-inf, 2.0] │
│ 3.0 ┆ 4.0 ┆ (2.0, 4.0] │
│ 5.0 ┆ 6.0 ┆ (4.0, 6.0] │
│ 7.0 ┆ inf ┆ (6.0, inf] │
│ 9.0 ┆ inf ┆ (6.0, inf] │
└─────┴─────────────┴─────────────┘
If you just want to add the cut categories in your main DataFrame. You can do so in a with_columns()
directly:
df.with_columns(
df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category'].alias('x_cut')
)
# or
df.with_columns(
x_cut=df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category']
)
shape: (5, 4)
┌─────┬─────┬─────┬─────────────┐
│ a ┆ b ┆ x ┆ x_cut │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ cat │
╞═════╪═════╪═════╪═════════════╡
│ 1 ┆ 2 ┆ 1 ┆ (-inf, 2.0] │
│ 2 ┆ 3 ┆ 3 ┆ (2.0, 4.0] │
│ 3 ┆ 4 ┆ 5 ┆ (4.0, 6.0] │
│ 4 ┆ 5 ┆ 7 ┆ (6.0, inf] │
│ 5 ┆ 6 ┆ 9 ┆ (6.0, inf] │
└─────┴─────┴─────┴─────────────┘
How can I use it in select context, such as df.with_columns?
To be more specific, if I have a polars dataframe with a lot of columns and one of them is called x, how can I do pl.cut on x and append the grouping result into the original dataframe?
Below is what I tried but it does not work:
df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [2, 3, 4, 5, 6], "x": [1, 3, 5, 7, 9]}
df.with_columns(pl.cut(pl.col("x"), bins=[2, 4, 6]))
Thanks so much for your help.
From the docs, as of 2023-01-25, cut
takes a Series and returns a DataFrame. Unlike many/most methods and functions, it doesn’t take an expression so you can’t use it in a select
or with_column(s)
. To get your desired result you’d have to join it to your original df.
Additionally, it appears that cut
doesn’t necessarily maintain the same dtypes as the parent series. (This is most certainly a bug) As such you have to cast it back to, in this case, int.
You’d have:
df=df.join(
pl.cut(df.get_column('x'),bins=[2,4,6]).with_column(pl.col('x').cast(pl.Int64())),
on='x'
)
shape: (5, 5)
┌─────┬─────┬─────┬─────────────┬─────────────┐
│ a ┆ b ┆ x ┆ break_point ┆ category │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ f64 ┆ cat │
╞═════╪═════╪═════╪═════════════╪═════════════╡
│ 1 ┆ 2 ┆ 1 ┆ 2.0 ┆ (-inf, 2.0] │
│ 2 ┆ 3 ┆ 3 ┆ 4.0 ┆ (2.0, 4.0] │
│ 3 ┆ 4 ┆ 5 ┆ 6.0 ┆ (4.0, 6.0] │
│ 4 ┆ 5 ┆ 7 ┆ inf ┆ (6.0, inf] │
│ 5 ┆ 6 ┆ 9 ┆ inf ┆ (6.0, inf] │
└─────┴─────┴─────┴─────────────┴─────────────┘
As of 0.16.8
, the top-level function pl.cut
has been deprecated. You have to use the series method .cut
instead now, which returns a three-column DataFrame.
df = pl.DataFrame(
{"a": [1, 2, 3, 4, 5],
"b": [2, 3, 4, 5, 6],
"x": [1, 3, 5, 7, 9]}
)
# get x column as a Series and then apply .cut method
df['x'].cut(bins=[2, 4, 6])
It returns a DataFrame like the following:
shape: (5, 3)
┌─────┬─────────────┬─────────────┐
│ x ┆ break_point ┆ category │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ cat │
╞═════╪═════════════╪═════════════╡
│ 1.0 ┆ 2.0 ┆ (-inf, 2.0] │
│ 3.0 ┆ 4.0 ┆ (2.0, 4.0] │
│ 5.0 ┆ 6.0 ┆ (4.0, 6.0] │
│ 7.0 ┆ inf ┆ (6.0, inf] │
│ 9.0 ┆ inf ┆ (6.0, inf] │
└─────┴─────────────┴─────────────┘
If you just want to add the cut categories in your main DataFrame. You can do so in a with_columns()
directly:
df.with_columns(
df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category'].alias('x_cut')
)
# or
df.with_columns(
x_cut=df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category']
)
shape: (5, 4)
┌─────┬─────┬─────┬─────────────┐
│ a ┆ b ┆ x ┆ x_cut │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ cat │
╞═════╪═════╪═════╪═════════════╡
│ 1 ┆ 2 ┆ 1 ┆ (-inf, 2.0] │
│ 2 ┆ 3 ┆ 3 ┆ (2.0, 4.0] │
│ 3 ┆ 4 ┆ 5 ┆ (4.0, 6.0] │
│ 4 ┆ 5 ┆ 7 ┆ (6.0, inf] │
│ 5 ┆ 6 ┆ 9 ┆ (6.0, inf] │
└─────┴─────┴─────┴─────────────┘