OpenMP / Pybind11: Accessing Python object in for loop returns interned string error
Question:
I am trying to use OpenMP on a list of Python objects by using Pybind11 in C++. I transform this list in an std::vector of Python objects (as explained in this post) and then try to access them in a parallelized for loop. However, when invoking the attributes of any python object in the vector in the for loop, I get the error:
Fatal Python error: deletion of interned string failed
Thread 0x00007fd282bc7700 (most recent call first):
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
My questions are: What is the deletion of interned string error ? and how to avoid it in OpenMP ?
I have read here that the problem is with respect to the copy of the string, so I tried to refer to the string with a pointer but it didn’t help. Also, the problem doesn’t come from a conversion problem in Pybind, because if I remove the #pragma omp
clause, the code works perfectly.
C++ Code
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <pybind11/stl.h>
#include <omp.h>
#include <chrono>
#include <thread>
namespace py = pybind11;
py::object create_seq(
py::object self
){
std::vector<py::object> dict = self.cast<std::vector<py::object>>();
#pragma omp parallel for
for(unsigned int i=0; i<dict.size(); i++) {
dict[i].attr("attribute") = 2;
}
return self;
}
PYBIND11_MODULE(error, m){
m.doc() = "pybind11 module for iterating over generations";
m.def("create_seq", &create_seq,
"the function which creates a sequence");
}
Python Code
import error
class test():
def __init__(self):
self.attribute = None
if __name__ == '__main__':
dict = {}
for i in range(50):
dict[i] = test()
pop = error.create_seq(list(dict.values()))
Compiled with:
g++ -O3 -Wall -shared -std=c++14 -fopenmp -fPIC `python3 -m pybind11 --includes` openmp.cpp -o error.so
Answers:
You can not reliably call any Python C-API code (which underlies pybind11), without holding the Global Interpreter Lock (GIL). Handing the GIL in your OpenMP loop for each access on each thread will effectively serialize the loop, but now with added overhead, so it will be slower than running it serially in the first place.
As for interned strings: the Python interpreter saves common immutable objects such as certain strings and small integers to prevent them from being created over and over again. Such common strings are said to be “interned”, and this typically happens under the hood (although you can add your own using PyString_InternFromString
/PyUnicode_InternFromString
). Since these are singleton objects by design (that’s their purpose, after all), only one thread should create/delete them.
I was able to find a solution, but I think I am just doing a single threaded work with multiple threads. I used a #pragma omp ordered in the following way:
std::vector<py::object> dict = self.cast<std::vector<py::object>>();
#pragma omp parallel for ordered schedule(dynamic)
for(unsigned int i=0; i<dict.size(); i++) {
py::object genome = dict[i];
std::cout << i << std::endl;
#pragma omp ordered
genome.attr("fitness")=2;
}
And this works
EDIT
I controlled the execution time with and without parallelization and it’s the same
I am trying to use OpenMP on a list of Python objects by using Pybind11 in C++. I transform this list in an std::vector of Python objects (as explained in this post) and then try to access them in a parallelized for loop. However, when invoking the attributes of any python object in the vector in the for loop, I get the error:
Fatal Python error: deletion of interned string failed
Thread 0x00007fd282bc7700 (most recent call first):
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
My questions are: What is the deletion of interned string error ? and how to avoid it in OpenMP ?
I have read here that the problem is with respect to the copy of the string, so I tried to refer to the string with a pointer but it didn’t help. Also, the problem doesn’t come from a conversion problem in Pybind, because if I remove the #pragma omp
clause, the code works perfectly.
C++ Code
#include <pybind11/pybind11.h>
#include <pybind11/numpy.h>
#include <pybind11/stl.h>
#include <omp.h>
#include <chrono>
#include <thread>
namespace py = pybind11;
py::object create_seq(
py::object self
){
std::vector<py::object> dict = self.cast<std::vector<py::object>>();
#pragma omp parallel for
for(unsigned int i=0; i<dict.size(); i++) {
dict[i].attr("attribute") = 2;
}
return self;
}
PYBIND11_MODULE(error, m){
m.doc() = "pybind11 module for iterating over generations";
m.def("create_seq", &create_seq,
"the function which creates a sequence");
}
Python Code
import error
class test():
def __init__(self):
self.attribute = None
if __name__ == '__main__':
dict = {}
for i in range(50):
dict[i] = test()
pop = error.create_seq(list(dict.values()))
Compiled with:
g++ -O3 -Wall -shared -std=c++14 -fopenmp -fPIC `python3 -m pybind11 --includes` openmp.cpp -o error.so
You can not reliably call any Python C-API code (which underlies pybind11), without holding the Global Interpreter Lock (GIL). Handing the GIL in your OpenMP loop for each access on each thread will effectively serialize the loop, but now with added overhead, so it will be slower than running it serially in the first place.
As for interned strings: the Python interpreter saves common immutable objects such as certain strings and small integers to prevent them from being created over and over again. Such common strings are said to be “interned”, and this typically happens under the hood (although you can add your own using PyString_InternFromString
/PyUnicode_InternFromString
). Since these are singleton objects by design (that’s their purpose, after all), only one thread should create/delete them.
I was able to find a solution, but I think I am just doing a single threaded work with multiple threads. I used a #pragma omp ordered in the following way:
std::vector<py::object> dict = self.cast<std::vector<py::object>>();
#pragma omp parallel for ordered schedule(dynamic)
for(unsigned int i=0; i<dict.size(); i++) {
py::object genome = dict[i];
std::cout << i << std::endl;
#pragma omp ordered
genome.attr("fitness")=2;
}
And this works
EDIT
I controlled the execution time with and without parallelization and it’s the same