Find value of column based on another column condition (max) in polars for many columns
Question:
If I have this dataframe:
pl.DataFrame(dict(x=[0, 1, 2, 3], y=[5, 2, 3, 3],z=[4,7,8,2]))
shape: (4, 3)
┌─────┬─────┬─────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0 ┆ 5 ┆ 4 │
│ 1 ┆ 2 ┆ 7 │
│ 2 ┆ 3 ┆ 8 │
│ 3 ┆ 3 ┆ 2 │
└─────┴─────┴─────┘
and I want to find the value in x where y is max, then again find the value in x where z is max, and repeat for hundreds more columns so that I end up with something like:
shape: (2, 2)
┌────────┬─────────┐
│ column ┆ x_value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═════════╡
│ y ┆ 0 │
│ z ┆ 2 │
└────────┴─────────┘
or
shape: (1, 2)
┌─────┬─────┐
│ y ┆ z │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 2 │
└─────┴─────┘
What is the best polars way to do that?
Answers:
There is a PR to add by
to Expr.top_k()
which should allow:
y = pl.col("x").top_k(1, by="y")
z = pl.col("x").top_k(1, by="z")
Until then:
you could perform a "wide to long" reshape with .melt()
>>> df.melt("x")
shape: (8, 3)
┌─────┬──────────┬───────┐
│ x ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪══════════╪═══════╡
│ 0 ┆ y ┆ 5 │
│ 1 ┆ y ┆ 2 │
│ 2 ┆ y ┆ 3 │
│ 3 ┆ y ┆ 3 │
│ 0 ┆ z ┆ 4 │
│ 1 ┆ z ┆ 7 │
│ 2 ┆ z ┆ 8 │
│ 3 ┆ z ┆ 2 │
└─────┴──────────┴───────┘
Then .filter()
out the .peak_max()
per each group:
(df.melt("x")
.filter(
pl.col("value").peak_max().over("variable")
)
)
shape: (2, 3)
┌─────┬──────────┬───────┐
│ x ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪══════════╪═══════╡
│ 0 ┆ y ┆ 5 │
│ 2 ┆ z ┆ 8 │
└─────┴──────────┴───────┘
If I have this dataframe:
pl.DataFrame(dict(x=[0, 1, 2, 3], y=[5, 2, 3, 3],z=[4,7,8,2]))
shape: (4, 3)
┌─────┬─────┬─────┐
│ x ┆ y ┆ z │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 0 ┆ 5 ┆ 4 │
│ 1 ┆ 2 ┆ 7 │
│ 2 ┆ 3 ┆ 8 │
│ 3 ┆ 3 ┆ 2 │
└─────┴─────┴─────┘
and I want to find the value in x where y is max, then again find the value in x where z is max, and repeat for hundreds more columns so that I end up with something like:
shape: (2, 2)
┌────────┬─────────┐
│ column ┆ x_value │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪═════════╡
│ y ┆ 0 │
│ z ┆ 2 │
└────────┴─────────┘
or
shape: (1, 2)
┌─────┬─────┐
│ y ┆ z │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 2 │
└─────┴─────┘
What is the best polars way to do that?
There is a PR to add by
to Expr.top_k()
which should allow:
y = pl.col("x").top_k(1, by="y")
z = pl.col("x").top_k(1, by="z")
Until then:
you could perform a "wide to long" reshape with .melt()
>>> df.melt("x")
shape: (8, 3)
┌─────┬──────────┬───────┐
│ x ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪══════════╪═══════╡
│ 0 ┆ y ┆ 5 │
│ 1 ┆ y ┆ 2 │
│ 2 ┆ y ┆ 3 │
│ 3 ┆ y ┆ 3 │
│ 0 ┆ z ┆ 4 │
│ 1 ┆ z ┆ 7 │
│ 2 ┆ z ┆ 8 │
│ 3 ┆ z ┆ 2 │
└─────┴──────────┴───────┘
Then .filter()
out the .peak_max()
per each group:
(df.melt("x")
.filter(
pl.col("value").peak_max().over("variable")
)
)
shape: (2, 3)
┌─────┬──────────┬───────┐
│ x ┆ variable ┆ value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪══════════╪═══════╡
│ 0 ┆ y ┆ 5 │
│ 2 ┆ z ┆ 8 │
└─────┴──────────┴───────┘