Pandas REPLACE equivalent in Python Polars

Question

Is there an elegant way how to recode values in polars dataframe.

For example

1->0, 
2->0, 
3->1...

in Pandas it is simple like that:

df.replace([1,2,3,4,97,98,99],[0,0,1,1,2,2,2])

Asked By: zenelb

||

Source

Answer 1

Edit 2022-02-12

As of polars >=0.16.4 there is a map_dict expression.

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5]
})

mapper = {
    1: 0,
    2: 0,
    3: 10,
    4: 10
}

df.select(
    pl.all().map_dict(mapper, default=pl.col("a"))
)

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 0   │
│ 10  │
│ 10  │
│ 5   │
└─────┘

Before Edit

In polars you can build columnar if else statetements called if -> then -> otherwise expressions.

So let’s say we have this DataFrame.

df = pl.DataFrame({
    "a": [1, 2, 3, 4, 5]
})

And we’d like to replace these with the following values:

from_ = [1, 2]
to_ = [99, 12]

We could write:

df.with_column(
    pl.when(pl.col("a") == from_[0])
    .then(to_[0])
    .when(pl.col("a") == from_[1])
    .then(to_[1])
    .otherwise(pl.col("a")).alias("a")
)

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 99  │
├╌╌╌╌╌┤
│ 12  │
├╌╌╌╌╌┤
│ 3   │
├╌╌╌╌╌┤
│ 4   │
├╌╌╌╌╌┤
│ 5   │
└─────┘

Don’t repeat yourself

Now, this becomes very tedious to write really fast, so we could write a function that generates these expressions for use, we are programmers aren’t we!

So to replace with the values you have suggested, you could do:

from_ = [1,2,3,4,97,98,99]
to_ = [0,0,1,1,2,2,2]


def replace(column, from_, to_):
    # initiate the expression with `pl.when`
    branch =  pl.when(pl.col(column) == from_[0]).then(to_[0])

    
    # for every value add a `when.then`
    for (from_value, to_value) in zip(from_, to_):
        branch = branch.when(pl.col(column) == from_value).then(to_value)

    # finish with an `otherwise`
    return branch.otherwise(pl.col(column)).alias(column)
    


df.with_column(replace("a", from_, to_))

Which outputs:

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
├╌╌╌╌╌┤
│ 0   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 1   │
├╌╌╌╌╌┤
│ 5   │
└─────┘

Answered By: ritchie46

Answer 2

Just in case you like the pandas docstrings as well and want to place it as a utils function somewhere in your repo

def replace(column: str, mapping: dict) -> pl.internals.expr.Expr:
    """
    Create a polars expression that replaces a columns values.

    Parameters
    ----------
    column : str
        Column name on which values should be replaced.
    mapping : dict
        Can be used to specify different replacement values for different existing values. For example,
        ``{'a': 'b', 'y': 'z'}`` replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. Values not mentioned in ``mapping``
        will stay the same.

    Returns
    -------
    pl.internals.expr.Expr
        Expression that contains instructions to replace values in ``column`` according to ``mapping``.

    Raises
    ------
    Exception
        * If ``mapping`` is empty.
    TypeError
        * If ``column`` is not ``str``.
        * If ``mapping`` is not ``dict``.
    polars.exceptions.PanicException
        * When ``mapping`` has keys or values that are not mappable to arrows format. Only catchable via BaseException.
          See also https://pola-rs.github.io/polars-book/user-guide/datatypes.html.

    Examples
    --------
    >>> import polars as pl
    >>> df = pl.DataFrame({'fruit':['banana', 'apple', 'pie']})
    >>> df
    shape: (3, 1)
    ┌────────┐
    │ fruit  │
    │ ---    │
    │ str    │
    ╞════════╡
    │ banana │
    ├╌╌╌╌╌╌╌╌┤
    │ apple  │
    ├╌╌╌╌╌╌╌╌┤
    │ apple  │
    └────────┘
    >>> df.with_column(replace(column='fruit', mapping={'apple': 'pomegranate'}))
    shape: (3, 1)
    ┌─────────────┐
    │ fruit       │
    │ ---         │
    │ str         │
    ╞═════════════╡
    │ banana      │
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ pomegranate │
    ├╌╌╌╌╌╌╌╌╌╌╌╌╌┤
    │ pomegranate │
    └─────────────┘

    """
    if not mapping:
        raise Exception("Mapping can't be empty")
    elif not isinstance(mapping, dict):
        TypeError(f"mapping must be of type dict, but is type: {type(mapping)}")
    if not isinstance(column, str):
        raise TypeError(f"column must be of type str, but is type: {type(column)}")

    branch = pl.when(pl.col(column) == list(mapping.keys())[0]).then(
        list(mapping.values())[0]
    )

    for from_value, to_value in mapping.items():
        branch = branch.when(pl.col(column) == from_value).then(to_value)

    return branch.otherwise(pl.col(column)).alias(column)

Answered By: supersick

Answer 3

You can also use apply with a dict, as long as you specify an exhaustive mapping for each from_ option:

df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})

from_ = [1, 2, 3, 4, 5]
to_ = [99, 12, 4, 18, 64]
my_map = dict(zip(from_, to_))

df.select(pl.col("a").apply(lambda x: my_map[x]))

which outputs:

shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 99  │
├╌╌╌╌╌┤
│ 12  │
├╌╌╌╌╌┤
│ 4   │
├╌╌╌╌╌┤
│ 18  │
├╌╌╌╌╌┤
│ 64  │
└─────┘

It’ll be slower than ritchie46’s answer but it’s quite a bit simpler.

Answered By: NedDasty

Answer 4

Can’t use code snippet in comments, so I’ll post this slight generalization as an answer.

In case the value to be mapped is missing from the mapping, this accepts a default value if provided, else it will act as if the mapping is the identity mapping.

import polars as pl
def apply_map(
    column: str, mapping: dict, default = None
) -> pl.Expr:
    branch = pl
    for key, value in mapping.items():
        branch = branch.when(pl.col(column) == key).then(value)
    default = pl.lit(default) if default is not None else pl.col(column)
    return branch.otherwise(default).alias(column)

Answered By: Radu

Answer 5

This could also be done by

converting a mapping dictionary into a polars.DataFrame
LEFT JOINing the original data to the mapper DataFrame
fill any missing values (which the mapper dict does not account for) with the original values
- .map_dict() does this using default=...
drop the original column

Here is the code to show this

Data

df = pl.DataFrame({"a": [1, 2, 3, 4, 5]})
print(df)
shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 1   │
│ 2   │
│ 3   │
│ 4   │
│ 5   │
└─────┘

Define the mapper dict

mapper = {1: 0, 2: 0, 3: 10, 4: 10}

Create a mapper DataFrame

df_mapper = pl.DataFrame([{"a": k, "values": v} for k, v in mapper.items()])
print(df_mapper)
shape: (4, 2)
┌─────┬────────┐
│ a   ┆ values │
│ --- ┆ ---    │
│ i64 ┆ i64    │
╞═════╪════════╡
│ 1   ┆ 0      │
│ 2   ┆ 0      │
│ 3   ┆ 10     │
│ 4   ┆ 10     │
└─────┴────────┘

Use LEFT JOIN and .fill_null() to map the values, and then drop the original column

df = (
    df
    # LEFT JOIN
    .join(df_mapper, on=["a"], how="left")
    # fill missing values in mapped column with values from original column
    .with_columns([pl.col("values").fill_null(pl.col("a"))])
    # drop original column and replace with mapped column
    .drop(["a"])
    .rename({"values": "a"})
)
print(df)
shape: (5, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 0   │
│ 0   │
│ 10  │
│ 10  │
│ 5   │
└─────┘

Answered By: edesz

Pandas REPLACE equivalent in Python Polars

Question:

Answers:

Edit 2022-02-12

Before Edit

Don’t repeat yourself