Adding a column based on condition in Polars
Question:
Let’s say I have a Polars dataframe like so:
df = pl.DataFrame({
'a': [0.3, 0.7, 0.5, 0.1, 0.9]
})
And now I need to add a new column where 1 or 0 is assigned depending on whether a value in column 'a'
is greater or less than some threshold. In Pandas I can do this:
import numpy as np
THRESHOLD = 0.5
df['new'] = np.where(df.a > THRESHOLD, 0, 1)
I can also do something very similar in Polars:
df = df.with_columns(
pl.lit(np.where(df.select('a').to_numpy() > THRESHOLD, 0, 1).ravel())
.alias('new')
)
This works fine but I’m sure that using NumPy here is not the best practice.
I’ve also tried something more like:
df = df.with_columns(
pl.lit(df.filter(pl.col('a') > THRESHOLD).select([0, 1]))
.alias('new')
)
But with this syntax I keep running into the following error:
DuplicateError Traceback (most recent call last)
Cell In[47], line 5
1 THRESHOLD = 0.5
2 DELAY_TOLERANCE = 10
4 df = df.with_columns(
----> 5 pl.lit(df.filter(pl.col('a') > THRESHOLD).select([0, 1]))
6 .alias('new')
7 )
8 df.head()
DuplicateError: column with name 'literal' has more than one occurrences
So my question is two-fold: what am I doing wrong here and what is the best practice in Polars for such conditional assignments?
I did looks through docs and previous questions but couldn’t find anything resembling my use-case.
Answers:
The select([0, 1])
doesn’t really make a lot of sense Polars-wise, you’re just selecting a literal. Not quite sure why that’s throwing a DuplicateError as is.
Conditionals in polars are best done with when
:
df.with_columns(pl.when(pl.col("a") > 0.5).then(0).otherwise(1).alias("b"))
Let’s say I have a Polars dataframe like so:
df = pl.DataFrame({
'a': [0.3, 0.7, 0.5, 0.1, 0.9]
})
And now I need to add a new column where 1 or 0 is assigned depending on whether a value in column 'a'
is greater or less than some threshold. In Pandas I can do this:
import numpy as np
THRESHOLD = 0.5
df['new'] = np.where(df.a > THRESHOLD, 0, 1)
I can also do something very similar in Polars:
df = df.with_columns(
pl.lit(np.where(df.select('a').to_numpy() > THRESHOLD, 0, 1).ravel())
.alias('new')
)
This works fine but I’m sure that using NumPy here is not the best practice.
I’ve also tried something more like:
df = df.with_columns(
pl.lit(df.filter(pl.col('a') > THRESHOLD).select([0, 1]))
.alias('new')
)
But with this syntax I keep running into the following error:
DuplicateError Traceback (most recent call last)
Cell In[47], line 5
1 THRESHOLD = 0.5
2 DELAY_TOLERANCE = 10
4 df = df.with_columns(
----> 5 pl.lit(df.filter(pl.col('a') > THRESHOLD).select([0, 1]))
6 .alias('new')
7 )
8 df.head()
DuplicateError: column with name 'literal' has more than one occurrences
So my question is two-fold: what am I doing wrong here and what is the best practice in Polars for such conditional assignments?
I did looks through docs and previous questions but couldn’t find anything resembling my use-case.
The select([0, 1])
doesn’t really make a lot of sense Polars-wise, you’re just selecting a literal. Not quite sure why that’s throwing a DuplicateError as is.
Conditionals in polars are best done with when
:
df.with_columns(pl.when(pl.col("a") > 0.5).then(0).otherwise(1).alias("b"))