How to filter duplicates based on multiple columns in Polars?
Question:
I was earlier able to filter duplicates based on multiple columns using df.filter(pl.col(['A','C']).is_duplicated())
but after the latest version update this is not working.
import polars as pl
df = pl.DataFrame(
{
"A": [1,4,4,7,7,10,10,13,16],
"B": [2,5,5,8,18,11,11,14,17],
"C": [3,6,6,9,9,12,12,15,18]
}
)
df.filter(pl.col(['A','C']).is_duplicated())
giving error
df.filter(df.select(
pl.col(['A','C']).is_duplicated()
)
)
giving error
Answers:
This behavior was noted as ambiguous in 0.16.10 and would return this error:
exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions:
col("A").is_duplicated(),
col("C").is_duplicated(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.
However 0.19.0 removed the deprecated behavior of all
/any
replaced by all_horizontal
and any_horizontal
. To get the same behavior as the pre-0.16.10, use df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated()))
I’ve modified the input slightly to reflect the differences between any_horizontal
and all_horizontal
import polars as pl
df = pl.DataFrame(
{
"A": [1,3,4,7,7,10,10,13,16],
"B": [2,5,5,8,18,11,11,14,17],
"C": [3,6,6,9,9,12,12,15,18]
}
)
# print("legacy run in 0.16.9: ", df.filter(pl.col(['A','C']).is_duplicated()))
print("all_horizontal: ", df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated())))
print("any_horizontal: ", df.filter(pl.any_horizontal(pl.col(['A','C']).is_duplicated())))
legacy run in 0.16.9: shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
all_horizontal: shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
any_horizontal: shape: (6, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 5 ┆ 6 │
│ 4 ┆ 5 ┆ 6 │
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
I was earlier able to filter duplicates based on multiple columns using df.filter(pl.col(['A','C']).is_duplicated())
but after the latest version update this is not working.
import polars as pl
df = pl.DataFrame(
{
"A": [1,4,4,7,7,10,10,13,16],
"B": [2,5,5,8,18,11,11,14,17],
"C": [3,6,6,9,9,12,12,15,18]
}
)
df.filter(pl.col(['A','C']).is_duplicated())
giving error
df.filter(df.select(
pl.col(['A','C']).is_duplicated()
)
)
giving error
This behavior was noted as ambiguous in 0.16.10 and would return this error:
exceptions.ComputeError: The predicate passed to 'LazyFrame.filter' expanded to multiple expressions:
col("A").is_duplicated(),
col("C").is_duplicated(),
This is ambiguous. Try to combine the predicates with the 'all' or `any' expression.
However 0.19.0 removed the deprecated behavior of all
/any
replaced by all_horizontal
and any_horizontal
. To get the same behavior as the pre-0.16.10, use df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated()))
I’ve modified the input slightly to reflect the differences between any_horizontal
and all_horizontal
import polars as pl
df = pl.DataFrame(
{
"A": [1,3,4,7,7,10,10,13,16],
"B": [2,5,5,8,18,11,11,14,17],
"C": [3,6,6,9,9,12,12,15,18]
}
)
# print("legacy run in 0.16.9: ", df.filter(pl.col(['A','C']).is_duplicated()))
print("all_horizontal: ", df.filter(pl.all_horizontal(pl.col(['A','C']).is_duplicated())))
print("any_horizontal: ", df.filter(pl.any_horizontal(pl.col(['A','C']).is_duplicated())))
legacy run in 0.16.9: shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
all_horizontal: shape: (4, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘
any_horizontal: shape: (6, 3)
┌─────┬─────┬─────┐
│ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 5 ┆ 6 │
│ 4 ┆ 5 ┆ 6 │
│ 7 ┆ 8 ┆ 9 │
│ 7 ┆ 18 ┆ 9 │
│ 10 ┆ 11 ┆ 12 │
│ 10 ┆ 11 ┆ 12 │
└─────┴─────┴─────┘