A faster numpy.polynomial?

Question:

I have a very simple problem: in my python toolbox, I have to compute the values of polynomials (usually degree 3 or 2, seldom others, always integer degree) from a large vector (size >> 10^6). Storing the result in a buffer is not an option because I have several of these vectors so I would quickly run out of memory, and I usually have to compute it only once in any case. The performance of numpy.polyval is actually quite good, but still this is my bottleneck. Can I somehow make the evaluation of the polynomial faster?

Addendum

I think that the pure-numpy solution of Joe Kington is good for me, in particular because it avoids potential issues at installation time of other libraries or cython. For those who asked, the numbers in the vector are large (order 10^4), so I don’t think that the suggested approximations would work.

Asked By: user3498123

||

Answers:

You actually can speed it up slightly by doing the operations in-place (or using numexpr or numba which will automatically do what I’m doing manually below).

numpy.polyval is a very short function. Leaving out a few type checks, etc, it amounts to:

def polyval(p, x):
    y = np.zeros_like(x)
    for i in range(len(p)):
        y = x * y + p[i]
    return y

The downside to this approach is that a temporary array will be created inside the loop as opposed to doing the operation in-place.

What I’m about to do is a micro-optimization and is only worthwhile for very large x inputs. Furthermore, we’ll have to assume floating-point output instead of letting the upcasting rules determine the output’s dtype. However, it will speed this up slighly and make it use less memory:

def faster_polyval(p, x):
    y = np.zeros(x.shape, dtype=float)
    for i, v in enumerate(p):
        y *= x
        y += v
    return y

As an example, let’s say we have the following input:

# Third order polynomial
p = [4.5, 9.8, -9.2, 1.2]

# One-million element array
x = np.linspace(-10, 10, 1e6)

The results are identical:

In [3]: np_result = np.polyval(p, x)

In [4]: new_result = faster_polyval(p, x)

In [5]: np.allclose(np_result, new_result)
Out[5]: True

And we get a modest 2-3x speedup (which is mostly independent of array size, as it relates to memory allocation, not number of operations):

In [6]: %timeit np.polyval(p, x)
10 loops, best of 3: 20.7 ms per loop

In [7]: %timeit faster_polyval(p, x)
100 loops, best of 3: 7.46 ms per loop

For really huge inputs, the memory usage difference will matter more than the speed differences. The “bare” numpy version will use ~2x more memory at peak usage than the faster_polyval version.

Answered By: Joe Kington

I ended up here, when I wanted to know whether np.polyval or np.polynomial.polynomial.polyval is faster.
And it is interesting to see that simple implementations are faster as @Joe Kington shows. (I hoped for some optimisation by numpy.)

So here is my comparison with np.polynomial.polynomial.polyval and a slightly faster version.

def fastest_polyval(x, a):
    y = a[-1]
    for ai in a[-2::-1]:
        y *= x
        y += ai
    return y

It avoids the initial zero array and needs one loop less.

y_np = np.polyval(p, x)
y_faster = faster_polyval(p, x)
prev = 1 * p[::-1]   # reverse coefficients
y_np2 = np.polynomial.polynomial.polyval(x, prev)
y_fastest = fastest_polyval(x, prev)

np.allclose(y_np, y_faster), np.allclose(y_np, y_np2), np.allclose(y_np, y_fastest)
# (True, True, True)
%timeit np.polyval(p, x)
%timeit faster_polyval(p, x)
%timeit np.polynomial.polynomial.polyval(x, prev)
%timeit fastest_polyval(x, prev)

# 6.51 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 3.69 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 6.28 ms ± 43.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# 2.65 ms ± 35.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)