Polars Modify Many Columns Based On Value In Another Column
Question:
Say I have a DataFrame that looks like this:
df = pl.DataFrame({
"id": [1, 2, 3, 4, 5],
"feature_a": np.random.randint(0, 3, 5),
"feature_b": np.random.randint(0, 3, 5),
"label": [1, 0, 0, 1, 1],
})
┌─────┬───────────┬───────────┬───────┐
│ id ┆ feature_a ┆ feature_b ┆ label │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════╡
│ 1 ┆ 2 ┆ 0 ┆ 1 │
│ 2 ┆ 1 ┆ 1 ┆ 0 │
│ 3 ┆ 2 ┆ 2 ┆ 0 │
│ 4 ┆ 1 ┆ 0 ┆ 1 │
│ 5 ┆ 0 ┆ 0 ┆ 1 │
└─────┴───────────┴───────────┴───────┘
I want to modify all the features columns based on the value in the label column, producing a new DataFrame.
┌─────┬───────────┬───────────┐
│ id ┆ feature_a ┆ feature_b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╡
│ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 0 ┆ 0 │
│ 3 ┆ 0 ┆ 0 │
│ 4 ┆ 1 ┆ 1 │
│ 5 ┆ 1 ┆ 1 │
└─────┴───────────┴───────────┘
I know I can select all the features columns by using regex in the column selector
pl.col(r"^feature_.*$")
And I can use a when/then expression to evaluate the label column
pl.when(pl.col("label") == 1).then(1).otherwise(0)
But I can’t seem to put the 2 together to modify all the selected columns in one fell swoop. It seems so simple, what am I missing?
Answers:
Here’s one way:
Recently support was added for more ergonomic arguments in a lot of methods, including with_columns
and select
. Since they now can take any number of keyword arguments acting like an alias
at the end (e.g. setting the new column name), we can construct a dict of the columns to overwrite and pass it in (with unpacking) like so:
df.select('id', **{col : 'label' for col in df.columns if col.startswith('feature')})
In this simple case no when/then is needed for the label
column, but in general any expression evaluating to a column of the same height as id
can go into this dict comprehension.
Say I have a DataFrame that looks like this:
df = pl.DataFrame({
"id": [1, 2, 3, 4, 5],
"feature_a": np.random.randint(0, 3, 5),
"feature_b": np.random.randint(0, 3, 5),
"label": [1, 0, 0, 1, 1],
})
┌─────┬───────────┬───────────┬───────┐
│ id ┆ feature_a ┆ feature_b ┆ label │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════╡
│ 1 ┆ 2 ┆ 0 ┆ 1 │
│ 2 ┆ 1 ┆ 1 ┆ 0 │
│ 3 ┆ 2 ┆ 2 ┆ 0 │
│ 4 ┆ 1 ┆ 0 ┆ 1 │
│ 5 ┆ 0 ┆ 0 ┆ 1 │
└─────┴───────────┴───────────┴───────┘
I want to modify all the features columns based on the value in the label column, producing a new DataFrame.
┌─────┬───────────┬───────────┐
│ id ┆ feature_a ┆ feature_b │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╡
│ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 0 ┆ 0 │
│ 3 ┆ 0 ┆ 0 │
│ 4 ┆ 1 ┆ 1 │
│ 5 ┆ 1 ┆ 1 │
└─────┴───────────┴───────────┘
I know I can select all the features columns by using regex in the column selector
pl.col(r"^feature_.*$")
And I can use a when/then expression to evaluate the label column
pl.when(pl.col("label") == 1).then(1).otherwise(0)
But I can’t seem to put the 2 together to modify all the selected columns in one fell swoop. It seems so simple, what am I missing?
Here’s one way:
Recently support was added for more ergonomic arguments in a lot of methods, including with_columns
and select
. Since they now can take any number of keyword arguments acting like an alias
at the end (e.g. setting the new column name), we can construct a dict of the columns to overwrite and pass it in (with unpacking) like so:
df.select('id', **{col : 'label' for col in df.columns if col.startswith('feature')})
In this simple case no when/then is needed for the label
column, but in general any expression evaluating to a column of the same height as id
can go into this dict comprehension.