Python script on PBS fails with error =>> PBS: job killed: ncpus 37.94 exceeded limit 36 (sum)

Question:

I get the error mentioned in the title when I run a python script (using Miniconda) on a PBS scheduler. I think that numpy is doing some multithreading/processing but I can’t stop it from doing so. I added these lines to my PBS script:

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
export OPENBLAS_NUM_THREADS=1
export VECLIB_MAXIMUM_THREADS=1

I also add these lines to my main.py, just for good measure:

import os
os.environ["OMP_NUM_THREADS"] = "1" 
os.environ["OPENBLAS_NUM_THREADS"] = "1" 
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
import numpy as np # Import numpy AFTER setting these variables

But to no avail — I still get the same error. I run my script as

qsub -q <QUEUE_NAME> -lnodes=1:ppn=36 path/to/script.sh"

Sources:

Two answers that tell you how to stop all/most unwanted multithreading/multiprocessing:

https://stackoverflow.com/a/48665619/3670097, https://stackoverflow.com/a/51954326/3670097

Summarizes how to do it from within a script: https://stackoverflow.com/a/53224849/3670097

This also fails

I went to each numpy computationaly intensive function and placed it in a context manager:

import threadpoolctl
with threadpoolctl.threadpool_limits(limits=1, user_api="blas"):
    D, P = np.linalg.eig(M, right=True)

Solution

TL;DR – use joblib.Parallel instead of multiprocessing.Pool:

from joblib import Parallel, delayed
Parallel(n_jobs=-1,backend='loky')(delayed(f)(x) for x in iterator)
Asked By: Yair Daon

||

Answers:

Runtime fix from https://stackoverflow.com/a/57505958/3528321 :

try:
    import mkl
    mkl.set_num_threads(1)
except:
    pass
Answered By: Dan Bonachea

It looks like the main issue came from using multiprocessing.Pool. When I switched to joblib.Parallel I stopped getting these messages. You can also try

with parallel_backend("loky", inner_max_num_threads=1):
    res = Parallel(n_jobs=-1)(delayed(f)(p) for p in  it())

But this might be overkill and may fail (see my question for a minimal working example).

Answered By: Yair Daon