Why is Polars running my "then" function even if the "when" condition is false?
Question:
Given this DataFrame:
d1 = pl.DataFrame({
'x': ['a', None, 'b'],
'y': [1.1, 2.2, 3.3]
})
print(d1)
shape: (3, 2)
┌──────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ str ┆ f64 │
╞══════╪═════╡
│ a ┆ 1.1 │
│ null ┆ 2.2 │
│ b ┆ 3.3 │
└──────┴─────┘
I want to transform rows with non-null x
into some f(x,y)=999
, else null
.
# Using [print(d), 999][-1] hack just to trace calls
d2 = d1.select(
pl.when(pl.col('x').is_not_null())
.then(pl.struct(['x','y']).apply(lambda d: [print(d), 999][-1]))
.otherwise(None)
.alias('s'))
print(d2)
Why this print(d)
output despite correct d2
? I expected the then
to be evaluated only if x.is_not_null
.
{'x': 1, 'y': 1.1}
{'x': None, 'y': 2.2} <<<< why?
{'x': 3, 'y': 3.3}
shape: (3, 1)
┌──────┐
│ s │
│ --- │
│ i64 │
╞══════╡
│ 999 │
│ null │
│ 999 │
└──────┘
Why is the print
executed even for the (null,2.2)
row?
Answers:
As for why this happens, it’s because .when()
and .then()
branches are executed in parallel, the "masking" is done afterwards.
You can .apply()
the result of .when().then()
.otherwise(None)
is the default
df.with_columns(apply =
pl.when(pl.col('x').is_not_null())
.then(pl.col("x"))
.apply(lambda self: [print(f"{self=}"), self][1])
)
self='a'
self='b'
shape: (3, 3)
┌──────┬─────┬───────┐
│ x ┆ y ┆ apply │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str │
╞══════╪═════╪═══════╡
│ a ┆ 1.1 ┆ a │
│ null ┆ 2.2 ┆ null │
│ b ┆ 3.3 ┆ b │
└──────┴─────┴───────┘
This only prints twice because .apply()
skips nulls by default.
It doesn’t appear to be working with a struct though:
df.with_columns(apply =
pl.when(pl.col('x').is_not_null())
.then(pl.struct("x", "y"))
.apply(lambda self: [print(f"{self=}"), self][1])
)
self={'x': 'a', 'y': 1.1}
self={'x': None, 'y': None}
self={'x': 'b', 'y': 3.3}
shape: (3, 3)
┌──────┬─────┬─────────────┐
│ x ┆ y ┆ apply │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ struct[2] │
╞══════╪═════╪═════════════╡
│ a ┆ 1.1 ┆ {"a",1.1} │
│ null ┆ 2.2 ┆ {null,null} │
│ b ┆ 3.3 ┆ {"b",3.3} │
└──────┴─────┴─────────────┘
Polars does consider a struct of all null values as null:
df.with_columns(apply =
pl.when(pl.col('x').is_not_null())
.then(pl.struct("x", "y"))
.is_null()
)
shape: (3, 3)
┌──────┬─────┬───────┐
│ x ┆ y ┆ apply │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞══════╪═════╪═══════╡
│ a ┆ 1.1 ┆ false │
│ null ┆ 2.2 ┆ true │
│ b ┆ 3.3 ┆ false │
└──────┴─────┴───────┘
So I think this could possibly be a "bug".
Given this DataFrame:
d1 = pl.DataFrame({
'x': ['a', None, 'b'],
'y': [1.1, 2.2, 3.3]
})
print(d1)
shape: (3, 2)
┌──────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ str ┆ f64 │
╞══════╪═════╡
│ a ┆ 1.1 │
│ null ┆ 2.2 │
│ b ┆ 3.3 │
└──────┴─────┘
I want to transform rows with non-null x
into some f(x,y)=999
, else null
.
# Using [print(d), 999][-1] hack just to trace calls
d2 = d1.select(
pl.when(pl.col('x').is_not_null())
.then(pl.struct(['x','y']).apply(lambda d: [print(d), 999][-1]))
.otherwise(None)
.alias('s'))
print(d2)
Why this print(d)
output despite correct d2
? I expected the then
to be evaluated only if x.is_not_null
.
{'x': 1, 'y': 1.1}
{'x': None, 'y': 2.2} <<<< why?
{'x': 3, 'y': 3.3}
shape: (3, 1)
┌──────┐
│ s │
│ --- │
│ i64 │
╞══════╡
│ 999 │
│ null │
│ 999 │
└──────┘
Why is the print
executed even for the (null,2.2)
row?
As for why this happens, it’s because .when()
and .then()
branches are executed in parallel, the "masking" is done afterwards.
You can .apply()
the result of .when().then()
.otherwise(None)
is the default
df.with_columns(apply =
pl.when(pl.col('x').is_not_null())
.then(pl.col("x"))
.apply(lambda self: [print(f"{self=}"), self][1])
)
self='a'
self='b'
shape: (3, 3)
┌──────┬─────┬───────┐
│ x ┆ y ┆ apply │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ str │
╞══════╪═════╪═══════╡
│ a ┆ 1.1 ┆ a │
│ null ┆ 2.2 ┆ null │
│ b ┆ 3.3 ┆ b │
└──────┴─────┴───────┘
This only prints twice because .apply()
skips nulls by default.
It doesn’t appear to be working with a struct though:
df.with_columns(apply =
pl.when(pl.col('x').is_not_null())
.then(pl.struct("x", "y"))
.apply(lambda self: [print(f"{self=}"), self][1])
)
self={'x': 'a', 'y': 1.1}
self={'x': None, 'y': None}
self={'x': 'b', 'y': 3.3}
shape: (3, 3)
┌──────┬─────┬─────────────┐
│ x ┆ y ┆ apply │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ struct[2] │
╞══════╪═════╪═════════════╡
│ a ┆ 1.1 ┆ {"a",1.1} │
│ null ┆ 2.2 ┆ {null,null} │
│ b ┆ 3.3 ┆ {"b",3.3} │
└──────┴─────┴─────────────┘
Polars does consider a struct of all null values as null:
df.with_columns(apply =
pl.when(pl.col('x').is_not_null())
.then(pl.struct("x", "y"))
.is_null()
)
shape: (3, 3)
┌──────┬─────┬───────┐
│ x ┆ y ┆ apply │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞══════╪═════╪═══════╡
│ a ┆ 1.1 ┆ false │
│ null ┆ 2.2 ┆ true │
│ b ┆ 3.3 ┆ false │
└──────┴─────┴───────┘
So I think this could possibly be a "bug".