how to use numpy-like vectorization properly to accelerate complex condition evaluation in pandas dataframe apply functions

Question:

numpy/pandas are known famous for their underlying acceleration, i.e. vectorization.

condition evaluation are common expressions that occurs in codes everywhere.

However, when using pandas dataframe apply function intuitively, the condition evaluation seems very slow.

An example of my apply code looks like:

 def condition_eval(df):
        x=df['x']
        a=df['a']
        b=df['b']
        if x <= a:
            d = round((x-a)/0.01)-1
            if d <- 10:
                d = -10
        elif x >= b:
            d = round((x-b)/0.01)+1
            if d > 10:
                d = 10
        else:
            d = 0 
        return d
df['eval_result'] = df.apply(condition_eval, axis=1)

The properties of such kind of problems could be:

  1. the result can be computed with only using its own row data, and always using multiple columns.
  2. each row has the same computation algorithm.
  3. the algorithm may contain complex conditional branches.

What’s the best practice in numpy/pandas to solve such kind of problems?


Some more thinkings.

In my opinion, one of the reason why vectorization acceleration can be effective is because the underlying cpu has some kind of vector instructions(e.g. SIMD, intel avx), which rely on a truth that the computational instructions have a deterministic behavior, i.e. no matter how the input data is, the result could be acquired after a fixed number of cpu cycles. Thus, parallelizing such kind of operations is easy.

However, branch execution in cpu is much more complicated. First of all, different branches of the same condition evaluation have different execution paths thus they may result in different cpu cycles. Modern cpus even leverage a lot of tricks like branch prediction which create more uncertainties.

So I wonder if and how pandas try to accelerate such kind of vector condition evaluation operations, and is their a better practice to work on such kind of computational workloads.

Asked By: Dai Zhang

||

Answers:

This should be equivalent:

import pandas as pd
import numpy as np

def get_eval_result(df):
    conditions = (
        df.x.le(df.a),
        df.x.gt(df.b),
    )
    choices = (
        np.where((d := df.x.sub(df.a).div(0.01).round().sub(1)).lt(-10), -10, d),
        np.where((d := df.x.sub(df.b).div(0.01).round().add(1)).gt(10), 10, d), 
    )
    return np.select(conditions, choices, 0)

df = df.assign(eval_result=get_eval_result)

My answer basically calculates the results of every branch, and then uses numpy syntax to specify which of those results should be used. This could be optimized slightly, but since it’s using purely vectorized function, it should be far faster than using .apply.

Answered By: BeRT2me

np.select is best for this:

(df
 .assign(column_to_alter=lambda x: np.select([cond1, cond2, cond3],
                                             [option1, opt2, opt3],
                                              default='somevalue'))
   
)
Answered By: William Rosenbaum
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.