Why is Python recursion so expensive and what can we do about it?

Question:

Suppose we want to compute some Fibonacci numbers, modulo 997.

For n=500 in C++ we can run

#include <iostream>
#include <array>

std::array<int, 2> fib(unsigned n) {
    if (!n)
        return {1, 1};
    auto x = fib(n - 1);
    return {(x[0] + x[1]) % 997, (x[0] + 2 * x[1]) % 997};
}

int main() {
    std::cout << fib(500)[0];
}

and in Python

def fib(n):
    if n==1:
        return (1, 2)
    x=fib(n-1)
    return ((x[0]+x[1]) % 997, (x[0]+2*x[1]) % 997)

if __name__=='__main__':
    print(fib(500)[0])

Both will find the answer 996 without issues. We are taking modulos to keep the output size reasonable and using pairs to avoid exponential branching.

For n=5000, the C++ code outputs 783, but Python will complain

RecursionError: maximum recursion depth exceeded in comparison

If we add a couple of lines

import sys

def fib(n):
    if n==1:
        return (1, 2)
    x=fib(n-1)
    return ((x[0]+x[1]) % 997, (x[0]+2*x[1]) % 997)

if __name__=='__main__':
    sys.setrecursionlimit(5000)
    print(fib(5000)[0])

then Python too will give the right answer.

For n=50000 C++ finds the answer 151 within milliseconds while Python crashes (at least on my machine).

Why are recursive calls so much cheaper in C++? Can we somehow modify the Python compiler to make it more receptive to recursion?

Of course, one solution is to replace recursion with iteration. For Fibonacci numbers, this is easy to do. However, this will swap the initial and the terminal conditions, and the latter is tricky for many problems (e.g. alpha–beta pruning). So generally, this will require a lot of hard work on the part of the programmer.

Asked By: jtallk

||

Answers:

You can increase the recursion limit using:

import sys

sys.setrecursionlimit(new_limit)

But note that this limit exists for a reason and that pure Python is not optimized for recursion (and compute-intensive tasks in general).

Answered By: qcoumes

The issue is that Python has an internal limit on number of recursive function calls.

That limit is configurable as shown in Quentin Coumes’ answer. However, too deep a function chain will result in a stack overflow. This underlying limitation¹ applies to both C++ and Python. This limitation also applies to all function calls, not just recursive ones.

In general: You should not write² algorithms that have recursion depth growth with linear complexity or worse. Logarithmically growing recursion is typically fine. Tail-recursive functions are trivial to re-write as iterations. Other recursions may be converted to iteration using external data structures (usually, a dynamic stack).

A related rule of thumb is that you shouldn’t have large objects with automatic storage. This is C++-specific since Python doesn’t have the concept of automatic storage.


¹ The underlying limitation is the execution stack size. The default size differs between systems, and different function calls consume different amounts of memory, so the limit isn’t specified as a number of calls but in bytes instead. This too is configurable on some systems. I wouldn’t typically recommend touching that limit due to portability concerns.

² Exceptions to this rule of thumb are certain functional languages that guarantee tail-recursion elimination – such as Haskell – where that rule can be relaxed in case of recursions that are guaranteed to be eliminated. Python is not such a language, and the function in question isn’t tail-recursive. While C++ compilers can perform the elimination as an optimization, it isn’t guaranteed, and is typically not optimized in debug builds. Hence, the exception doesn’t generally apply to C++ either.

Disclaimer: The following is my hypothesis; I don’t actually know their rationale: The Python limit is probably a feature that detects potentially infinite recursions, preventing likely unsafe stack overflow crashes and substituting a more controlled RecursionError.

Why are recursive calls so much cheaper in C++?

C++ is a compiled language. Python is interpreted. (Nearly) everything is cheaper in C++, except the translation from source code to an executable program.

Answered By: eerorika

Let me first answer your direct questions:

Why are recursive calls so much cheaper in C++?

Because C++ has no limitation on recursive call depth, except the size of the stack. And being a fully compiled language, loops (including recursion) are much faster in C++ than in Python (the reason why special Python modules like numpy/scipy directly use C routines). Additionally, most C++ implementations use a special feature called tail recursion elimination (see later in this post) and optimize some recursive code into iterative equivalents. This is nice here but not guaranteed by the standard, so other compilations could result in a program crashing miserably – but tail recursion is probably not involved here.

If the recursion is too deep and exhausts the available stack, you will invoke the well-known Undefined Behaviour where anything can happen, from an immediate crash to a program giving wrong results (IMHO the latter is much worse and cannot be detected…)

Can we somehow modify the Python compiler to make it more receptive to recursion?

No. Python implementation explicitly never uses tail recursion elimination. You could increase the recursion limit, but this is almost always a bad idea (see later in this post why).

Now for the true explanation of the underlying rationale.

Deep recursion is evil, full stop. You should never use it. Recursion is a handy tool when you can make sure that the depth will stay in sane limits. Python uses a soft limit to warn the programmer that something is going wrong before crashing the system. On the other hand, optimizing C and C++ compilers often internally change tail recursion into an iterative loop. But relying on it is highly dangerous because a slight change could prevent that optimization and cause an application crash.

As found in this other SO post, common Python implementations do not implement that tail recursion elimination. So you should not use recursion at a 5000 depth but instead use an iterative algorithm.

As your underlying computation will need all Fibonacci numbers up to the specified one, it is not hard to iteratively compute them. Furthermore, it will be much more efficient!

Answered By: Serge Ballesta

A solution is a trampoline: the recursive function, instead of calling another function, returns a function that makes that call with the appropriate arguments. There’s a loop one level higher that calls all those functions in a loop until we have the final result. I’m probably not explaining it very well; you can find resources online that do a better job.

The point is that this converts recursion to iteration. I don’t think this is faster, maybe it’s even slower, but the recursion depth stays low.

An implementation could look like below. I split the pair x into a and b for clarity. I then converted the recursive function to a version that keeps track of a and b as arguments, making it tail recursive.

def fib_acc(n, a, b):
    if n == 1:
        return (a, b)
    return lambda: fib_acc(n - 1, (a+b) % 997, (a+2*b) % 997)

def fib(n):
    x = fib_acc(n, 1, 2)
    while callable(x):
        x = x()
    return x

if __name__=='__main__':
    print(fib(50000)[0])
Answered By: Roel Schroeven

"Both will find the answer 996 without issues"

I do see at least one issue : the answer should be 836, not 996.

It seems that both your functions calculate Fibonacci(2*n) % p, and not Fibonacci(n) % p.

996 is the result of Fibonacci(1000) % 997.

Pick a more efficient algorithm

An inefficient algorithm stays an inefficient algorithm, regardless if it’s written in C++ or Python.

In order to compute large Fibonacci numbers, there are much faster methods than simple recursion with O(n) calls (see related article).

For large n, this recursive O(log n) Python function should run in circles around your above C++ code:

from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci(n, p):
    "Calculate Fibonacci(n) modulo p"
    if n < 3:
        return [0, 1, 1][n]
    if n % 2 == 0:
        m = n // 2
        v1 = fibonacci(m - 1, p)
        v2 = fibonacci(m, p)
        return (2*v1 + v2) * v2 % p
    else:
        m = (n + 1) // 2
        v1 = fibonacci(m, p) ** 2
        v2 = fibonacci(m - 1, p) ** 2
        return (v1 + v2) % p


print(fibonacci(500, 997))
#=> 836
print(fibonacci(1000, 997))
#=> 996

Try it online!

It will happily calculate fibonacci(10_000_000_000_000_000, 997).

It’s possible to add recursion level as parameter, in order to see how deep the recursion needs to go, and display it with indentation. Here’s an example for n=500:

# Recursion tree:
500
  249
    124
      61
        30
          14
            6
              2
              3
                1
                2
            7
              4
          15
            8
        31
          16
      62
    125
      63
        32
  250

Try it online!

Your examples would simply look like very long diagonals:

500
  499
    498
      ...
        ...
           1
Answered By: Eric Duminil

For Windows executables, the stack size is specified in the header of the executable. For the Windows version of Python 3.7 x64, that size is 0x1E8480 or exactly 2.000.000 bytes.

That version crashes with

Process finished with exit code -1073741571 (0xC00000FD)

and if we look that up we find that it’s a Stack Overflow.

What we can see on the (native) stack with a native debugger like WinDbg (enable child process debugging) is

[...]
fa 000000e9`6da1b680 00007fff`fb698a6e     python37!PyArg_UnpackStack+0x371
fb 000000e9`6da1b740 00007fff`fb68b841     python37!PyEval_EvalFrameDefault+0x73e
fc 000000e9`6da1b980 00007fff`fb698a6e     python37!PyArg_UnpackStack+0x371
fd 000000e9`6da1ba40 00007fff`fb68b841     python37!PyEval_EvalFrameDefault+0x73e
fe 000000e9`6da1bc80 00007fff`fb698a6e     python37!PyArg_UnpackStack+0x371
ff 000000e9`6da1bd40 00007fff`fb68b841     python37!PyEval_EvalFrameDefault+0x73e

2:011> ? 000000e9`6da1bd40 - 000000e9`6da1ba40 
Evaluate expression: 768 = 00000000`00000300

So Python will use 2 Stack frames per method call and there’s an enormous 768 bytes difference in the stack positions.

If you modify that value inside the EXE (make a backup copy) with a hex editor to, let’s say 256 MB

Python.exe edited in 010 Editor

you can run the following code

[...]
if __name__=='__main__':
    sys.setrecursionlimit(60000)
    print(fib(50000)[0])

and it will give 151 as the answer.


In C++, we can also force a Stack Overflow, e.g. by passing 500.000 as the parameter. While debugging, we get

0:000> .exr -1
ExceptionAddress: 00961015 (RecursionCpp!fib+0x00000015)
ExceptionCode: c00000fd (Stack overflow)
[...]
0:000> k
[...]
fc 00604f90 00961045     RecursionCpp!fib+0x45 [C:...RecursionCpp.cpp @ 7] 
fd 00604fb0 00961045     RecursionCpp!fib+0x45 [C:...RecursionCpp.cpp @ 7] 
fe 00604fd0 00961045     RecursionCpp!fib+0x45 [C:...RecursionCpp.cpp @ 7] 
ff 00604ff0 00961045     RecursionCpp!fib+0x45 [C:...RecursionCpp.cpp @ 7] 

0:000> ? 00604ff0 - 00604fd0
Evaluate expression: 32 = 00000020

which is just 1 stack frame per method call and only 32 bytes difference on the stack. Compared to Python, C++ can do 768/32 = 24x more recursions for the same stack size.

My Microsoft compiler created the executable with the default stack size of 1 MB (Release build, 32 Bit):

010 Editor for C++ executable

The 64 bit version has a stack difference of 64 bit (also Release build).


Tools used:

Answered By: Thomas Weller

An alternative to a trampoline is to use reduce.
if you can change the recursive function to be tail-recursive, you can implement it with reduce, here’s a possible implementation.

reduce is internally implemented iteratively, so you get to use your recursive function without blowing up the stack.

def inner_fib(acc, num):
    # acc is a list of two values 
    return [(acc[0]+acc[1]) % 997, (acc[0]+2*acc[1]) % 997]

def fib(n):
  return reduce(inner_fib, 
                range(2, n+1), # start from 2 since n=1 is covered in the base case
                [1,2])        # [1,2] is the base case

Answered By: abhinav omprakash