How to Write Poisson CDF as Python Polars Expression

Question:

I have a collection of polars expressions being used to generate features for an ML model. I’d like to add a poission cdf feature to this collection whilst maintaining lazy execution (with benefits of speed, caching etc…). I so far have not found an easy way of achieving this.

I’ve been able to get the result I’d like outside of the desired lazy expression framework with:

import polars as pl
from scipy.stats import poisson

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_column(pl.Series(result).alias("poission_cdf"))

However, in reality I’d like this to look like:

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
df = df.select(
    [
        ... # bunch of other expressions here
        poisson_cdf()
    ]
)

where poisson_cdf is some polars expression like:

def poisson_cdf():
    # this is just for illustration, clearly wont work
    return scipy.stats.poisson.cdf(pl.col("count"), pl.col("expected_count")).alias("poisson_cdf")

I also tried using a struct made up of "count" and "expected_count" and apply like advised in the docs when applying custom functions. However, my dataset is several millions of rows in reality – leading to absurd execution time.

Any advice or guidance here would be appreciated. Ideally there exists an expression like this somewhere out there? Thanks in advance!

Asked By: TheRealBenbo

||

Answers:

It sounds like you want to use .map() instead of .apply() – which will pass whole columns at once.

df.select([
   pl.all(),
   # ...
   pl.struct(["count", "expected_count"])
     .map(lambda x: 
        poisson.cdf(x.struct.field("count"), x.struct.field("expected_count")))
     .flatten()
     .alias("poisson_cdf")
])
shape: (5, 3)
┌───────┬────────────────┬─────────────┐
│ count | expected_count | poisson_cdf │
│ ---   | ---            | ---         │
│ i64   | f64            | f64         │
╞═══════╪════════════════╪═════════════╡
│ 9     | 7.7            | 0.75308     │
│ 2     | 0.2            | 0.998852    │
│ 3     | 0.7            | 0.994247    │
│ 4     | 1.1            | 0.994565    │
│ 5     | 7.5            | 0.241436    │
└───────┴────────────────┴─────────────┘
Answered By: jqurious

If scipy.stats.poisson.cdf was implemented as a proper numpy universal function, it would be possible to use it directly on polars expressions, but it is not. Fortunately, Poisson CDF is almost the same as regularized upper incomplete gamma function for which scipy supplies gammaincc which can be used in polars expressions:

>>> import polars as pl
>>> from scipy.special import gammaincc
>>> df = pl.select(pl.arange(0, 10).alias('k'))
>>> df.with_columns(cdf=gammaincc(pl.col('k') + 1, 4.0))
shape: (10, 2)
┌─────┬──────────┐
│ k   ┆ cdf      │
│ --- ┆ ---      │
│ i64 ┆ f64      │
╞═════╪══════════╡
│ 0   ┆ 0.018316 │
│ 1   ┆ 0.091578 │
│ 2   ┆ 0.238103 │
│ 3   ┆ 0.43347  │
│ ... ┆ ...      │
│ 6   ┆ 0.889326 │
│ 7   ┆ 0.948866 │
│ 8   ┆ 0.978637 │
│ 9   ┆ 0.991868 │
└─────┴──────────┘

The result is the same as returned by poisson.cdf:

>>> _.with_columns(cdf2=pl.lit(poisson.cdf(df['k'], 4)))
shape: (10, 3)
┌─────┬──────────┬──────────┐
│ k   ┆ cdf      ┆ cdf2     │
│ --- ┆ ---      ┆ ---      │
│ i64 ┆ f64      ┆ f64      │
╞═════╪══════════╪══════════╡
│ 0   ┆ 0.018316 ┆ 0.018316 │
│ 1   ┆ 0.091578 ┆ 0.091578 │
│ 2   ┆ 0.238103 ┆ 0.238103 │
│ 3   ┆ 0.43347  ┆ 0.43347  │
│ ... ┆ ...      ┆ ...      │
│ 6   ┆ 0.889326 ┆ 0.889326 │
│ 7   ┆ 0.948866 ┆ 0.948866 │
│ 8   ┆ 0.978637 ┆ 0.978637 │
│ 9   ┆ 0.991868 ┆ 0.991868 │
└─────┴──────────┴──────────┘

You want to take advantage of the fact that scipy has a set of functions which are numpy ufuncs as those

still have fast columnar operation through the NumPy API.

Specifically you want the pdtr function.

You then want to use reduce rather than map or apply as those are for generic python functions and aren’t going to perform as well.

So if we have…

df = pl.DataFrame({"count": [9,2,3,4,5], "expected_count": [7.7, 0.2, 0.7, 1.1, 7.5]})
result = poisson.cdf(df["count"].to_numpy(), df["expected_count"].to_numpy())
df = df.with_columns(pl.Series(result).alias("poission_cdf"))

then we can add to it with

df=df.with_columns([
    pl.reduce(f=pdtr, exprs=[pl.col('count'),pl.col('expected_count')]).alias("poicdf")
])
df

shape: (5, 4)
┌───────┬────────────────┬──────────────┬──────────┐
│ count ┆ expected_count ┆ poission_cdf ┆ poicdf   │
│ ---   ┆ ---            ┆ ---          ┆ ---      │
│ i64   ┆ f64            ┆ f64          ┆ f64      │
╞═══════╪════════════════╪══════════════╪══════════╡
│ 9     ┆ 7.7            ┆ 0.75308      ┆ 0.75308  │
│ 2     ┆ 0.2            ┆ 0.998852     ┆ 0.998852 │
│ 3     ┆ 0.7            ┆ 0.994247     ┆ 0.994247 │
│ 4     ┆ 1.1            ┆ 0.994565     ┆ 0.994565 │
│ 5     ┆ 7.5            ┆ 0.241436     ┆ 0.241436 │
└───────┴────────────────┴──────────────┴──────────┘

You can see it gives the same answer.

Answered By: Dean MacGregor