How to use polars cut method returning result to original df

Question:

How can I use it in select context, such as df.with_columns?

To be more specific, if I have a polars dataframe with a lot of columns and one of them is called x, how can I do pl.cut on x and append the grouping result into the original dataframe?

Below is what I tried but it does not work:

df = pl.DataFrame({"a": [1, 2, 3, 4, 5], "b": [2, 3, 4, 5, 6], "x": [1, 3, 5, 7, 9]}
df.with_columns(pl.cut(pl.col("x"), bins=[2, 4, 6]))

Thanks so much for your help.

Asked By: lebesgue

||

Answers:

From the docs, as of 2023-01-25, cut takes a Series and returns a DataFrame. Unlike many/most methods and functions, it doesn’t take an expression so you can’t use it in a select or with_column(s). To get your desired result you’d have to join it to your original df.

Additionally, it appears that cut doesn’t necessarily maintain the same dtypes as the parent series. (This is most certainly a bug) As such you have to cast it back to, in this case, int.

You’d have:

df=df.join(
    pl.cut(df.get_column('x'),bins=[2,4,6]).with_column(pl.col('x').cast(pl.Int64())),
    on='x'
)

shape: (5, 5)
┌─────┬─────┬─────┬─────────────┬─────────────┐
│ a   ┆ b   ┆ x   ┆ break_point ┆ category    │
│ --- ┆ --- ┆ --- ┆ ---         ┆ ---         │
│ i64 ┆ i64 ┆ i64 ┆ f64         ┆ cat         │
╞═════╪═════╪═════╪═════════════╪═════════════╡
│ 1   ┆ 2   ┆ 1   ┆ 2.0         ┆ (-inf, 2.0] │
│ 2   ┆ 3   ┆ 3   ┆ 4.0         ┆ (2.0, 4.0]  │
│ 3   ┆ 4   ┆ 5   ┆ 6.0         ┆ (4.0, 6.0]  │
│ 4   ┆ 5   ┆ 7   ┆ inf         ┆ (6.0, inf]  │
│ 5   ┆ 6   ┆ 9   ┆ inf         ┆ (6.0, inf]  │
└─────┴─────┴─────┴─────────────┴─────────────┘
Answered By: Dean MacGregor

As of 0.16.8, the top-level function pl.cut has been deprecated. You have to use the series method .cut instead now, which returns a three-column DataFrame.

df = pl.DataFrame(
    {"a": [1, 2, 3, 4, 5],
     "b": [2, 3, 4, 5, 6],
     "x": [1, 3, 5, 7, 9]}
)
# get x column as a Series and then apply .cut method
df['x'].cut(bins=[2, 4, 6])

It returns a DataFrame like the following:

shape: (5, 3)
┌─────┬─────────────┬─────────────┐
│ x   ┆ break_point ┆ category    │
│ --- ┆ ---         ┆ ---         │
│ f64 ┆ f64         ┆ cat         │
╞═════╪═════════════╪═════════════╡
│ 1.0 ┆ 2.0         ┆ (-inf, 2.0] │
│ 3.0 ┆ 4.0         ┆ (2.0, 4.0]  │
│ 5.0 ┆ 6.0         ┆ (4.0, 6.0]  │
│ 7.0 ┆ inf         ┆ (6.0, inf]  │
│ 9.0 ┆ inf         ┆ (6.0, inf]  │
└─────┴─────────────┴─────────────┘

If you just want to add the cut categories in your main DataFrame. You can do so in a with_columns() directly:

df.with_columns(
    df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category'].alias('x_cut')
)

# or
df.with_columns(
    x_cut=df['x'].cut(bins=[2, 4, 6], maintain_order=True)['category']
)
shape: (5, 4)
┌─────┬─────┬─────┬─────────────┐
│ a   ┆ b   ┆ x   ┆ x_cut       │
│ --- ┆ --- ┆ --- ┆ ---         │
│ i64 ┆ i64 ┆ i64 ┆ cat         │
╞═════╪═════╪═════╪═════════════╡
│ 1   ┆ 2   ┆ 1   ┆ (-inf, 2.0] │
│ 2   ┆ 3   ┆ 3   ┆ (2.0, 4.0]  │
│ 3   ┆ 4   ┆ 5   ┆ (4.0, 6.0]  │
│ 4   ┆ 5   ┆ 7   ┆ (6.0, inf]  │
│ 5   ┆ 6   ┆ 9   ┆ (6.0, inf]  │
└─────┴─────┴─────┴─────────────┘
Answered By: steven
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.