Python multiprocessing – problem manipulating data in a multiprocessing Array shared between parent and spawned class
Question:
I want to implement a way to share a table of information between a parent function and the instances of classes it will be spawning. According to what I read, I need to use a table of ctypes.c_char_p
of a given length.
I have managed to initialize that table from the parent function, that then I pass to the called class. from the __init__()
of the class I can access its contents. Finally, I try to manipulate them (reverse the name in this example), I confirm – from the class – that the shared array is populated as expected, but when I try to view the contents from the parent process, I get garbage.
My code is below:
#!/usr/bin/env python3
import multiprocessing
import ctypes
import random
import json
class employee(multiprocessing.Process):
def __init__(self, employee_data, shared_array_fields):
self.employee_data = employee_data
self.shared_array_fields = shared_array_fields
# self.lock = lock
print("**" * 100)
print("IN class:n", employee_data[:])
self.run()
def run(self):
for ii in range(self.shared_array_fields):
employee_string = self.employee_data[ii].decode("utf-8")
new_name_json = json.loads(employee_string)
new_name = new_name_json["name"][::-1]
self.employee_data[ii] = bytes('{ "name": "' + str(new_name) + '" }', "utf-8")
print("**" * 100)
print("IN class AFTER manipulation:n", self.employee_data[:])
shared_array_fields = 5
def main():
global shared_array_fields
lock = multiprocessing.Lock()
employee_data = multiprocessing.Array(ctypes.c_char_p, shared_array_fields)
for ii in range(shared_array_fields):
name = ''.join(random.choice(['a', 'b', 'c', 'd', 'e']) for i in range(10)) + "_" + str(ii)
employee_data[ii] = bytes('{ "name": "' + str(name) + '" }', "utf-8")
print("**" * 100)
print("BEFORE class:n", employee_data[:])
proc1 = multiprocessing.Process(target=employee, args=(employee_data, shared_array_fields))
proc1.start()
proc1.join()
# time.sleep(1)
print("**" * 100)
print("AFTER class:n", employee_data[:])
if __name__ == "__main__":
main()
Result:
[http_offline@greenhat-32 tmp]$ ./temp.py
********************************************************************************************************************************************************************************************************
BEFORE class:
[b'{ "name": "abbbabeadc_0" }', b'{ "name": "daeebeeabc_1" }', b'{ "name": "dbbceedece_2" }', b'{ "name": "caccdcbeae_3" }', b'{ "name": "ccdcbdabdb_4" }']
********************************************************************************************************************************************************************************************************
IN class:
[b'{ "name": "abbbabeadc_0" }', b'{ "name": "daeebeeabc_1" }', b'{ "name": "dbbceedece_2" }', b'{ "name": "caccdcbeae_3" }', b'{ "name": "ccdcbdabdb_4" }']
********************************************************************************************************************************************************************************************************
IN class AFTER manipulation:
[b'{ "name": "0_cdaebabbba" }', b'{ "name": "1_cbaeebeead" }', b'{ "name": "2_ecedeecbbd" }', b'{ "name": "3_eaebcdccac" }', b'{ "name": "4_bdbadbcdcc" }']
********************************************************************************************************************************************************************************************************
AFTER class:
[b'', b'', b'', b'', b'{ "name": "daeebeeabc_1" }']
[http_offline@greenhat-32 tmp]$
Answers:
Using an Array is very cumbersome since it is very difficult to rewrite a whole new value to it. Also why go through the necessity of converting from a dictionary to a string and back just to use an Array? Finally, any structure you wish to share between processes should be created by an a "manager" instance created by a call to multiprocessing.Manager()
unless you want to manage the synchronization yourself.
The easiest way to accomplish your goal is to have the manager create two Queue objects, an input queue (for the input to your process) and an output queue to hold the result. In this particular case you could use the same queue object for both, but this is cleaner and is generally what you would use when you had multiple inputs and outputs being processed simultaneously by a pool of processes and you weren’t using a standard library module such as multiprocessing.pool
or concurrent.futures
.
Finally, your construction of your Process
subclass requires a bit of tweaking (your constructor need to call the base class constructor and should not call run
). It’s also usual to name your classes with a capital letter, although I left the name unchanged. I also think it’s more usual not to subclass Process
. Generally one just writes a function and passes to the call to Process
target
, args
and/or kwargs
parameters.
Update to Use a Managed Dictionary
#!/usr/bin/env python3
import multiprocessing
import random
class employee(multiprocessing.Process):
def __init__(self, employees):
super().__init__() # init the base class !!!
self.employees = employees
print("**" * 100)
print("IN class:n", self.employees[:])
def run(self):
employees = self.employees
for i, employee in enumerate(employees):
new_name = employee["name"][::-1]
employee["name"] = new_name
employees[i] = employee # must be rewritten to show it has changed. Yuck!
#self.employees = employees
print("**" * 100)
print("IN class AFTER manipulation:n", employees[:])
def main():
manager = multiprocessing.Manager()
employees = manager.list()
for ii in range(5):
name = ''.join(random.choice(['a', 'b', 'c', 'd', 'e']) for i in range(10)) + "_" + str(ii)
employees.append({"name": name})
print("**" * 100)
print("BEFORE class:n", employees[:])
proc1 = employee(employees)
proc1.start()
proc1.join()
# time.sleep(1)
print("**" * 100)
print("AFTER class:n", employees[:])
if __name__ == "__main__":
main()
I want to implement a way to share a table of information between a parent function and the instances of classes it will be spawning. According to what I read, I need to use a table of ctypes.c_char_p
of a given length.
I have managed to initialize that table from the parent function, that then I pass to the called class. from the __init__()
of the class I can access its contents. Finally, I try to manipulate them (reverse the name in this example), I confirm – from the class – that the shared array is populated as expected, but when I try to view the contents from the parent process, I get garbage.
My code is below:
#!/usr/bin/env python3
import multiprocessing
import ctypes
import random
import json
class employee(multiprocessing.Process):
def __init__(self, employee_data, shared_array_fields):
self.employee_data = employee_data
self.shared_array_fields = shared_array_fields
# self.lock = lock
print("**" * 100)
print("IN class:n", employee_data[:])
self.run()
def run(self):
for ii in range(self.shared_array_fields):
employee_string = self.employee_data[ii].decode("utf-8")
new_name_json = json.loads(employee_string)
new_name = new_name_json["name"][::-1]
self.employee_data[ii] = bytes('{ "name": "' + str(new_name) + '" }', "utf-8")
print("**" * 100)
print("IN class AFTER manipulation:n", self.employee_data[:])
shared_array_fields = 5
def main():
global shared_array_fields
lock = multiprocessing.Lock()
employee_data = multiprocessing.Array(ctypes.c_char_p, shared_array_fields)
for ii in range(shared_array_fields):
name = ''.join(random.choice(['a', 'b', 'c', 'd', 'e']) for i in range(10)) + "_" + str(ii)
employee_data[ii] = bytes('{ "name": "' + str(name) + '" }', "utf-8")
print("**" * 100)
print("BEFORE class:n", employee_data[:])
proc1 = multiprocessing.Process(target=employee, args=(employee_data, shared_array_fields))
proc1.start()
proc1.join()
# time.sleep(1)
print("**" * 100)
print("AFTER class:n", employee_data[:])
if __name__ == "__main__":
main()
Result:
[http_offline@greenhat-32 tmp]$ ./temp.py
********************************************************************************************************************************************************************************************************
BEFORE class:
[b'{ "name": "abbbabeadc_0" }', b'{ "name": "daeebeeabc_1" }', b'{ "name": "dbbceedece_2" }', b'{ "name": "caccdcbeae_3" }', b'{ "name": "ccdcbdabdb_4" }']
********************************************************************************************************************************************************************************************************
IN class:
[b'{ "name": "abbbabeadc_0" }', b'{ "name": "daeebeeabc_1" }', b'{ "name": "dbbceedece_2" }', b'{ "name": "caccdcbeae_3" }', b'{ "name": "ccdcbdabdb_4" }']
********************************************************************************************************************************************************************************************************
IN class AFTER manipulation:
[b'{ "name": "0_cdaebabbba" }', b'{ "name": "1_cbaeebeead" }', b'{ "name": "2_ecedeecbbd" }', b'{ "name": "3_eaebcdccac" }', b'{ "name": "4_bdbadbcdcc" }']
********************************************************************************************************************************************************************************************************
AFTER class:
[b'', b'', b'', b'', b'{ "name": "daeebeeabc_1" }']
[http_offline@greenhat-32 tmp]$
Using an Array is very cumbersome since it is very difficult to rewrite a whole new value to it. Also why go through the necessity of converting from a dictionary to a string and back just to use an Array? Finally, any structure you wish to share between processes should be created by an a "manager" instance created by a call to multiprocessing.Manager()
unless you want to manage the synchronization yourself.
The easiest way to accomplish your goal is to have the manager create two Queue objects, an input queue (for the input to your process) and an output queue to hold the result. In this particular case you could use the same queue object for both, but this is cleaner and is generally what you would use when you had multiple inputs and outputs being processed simultaneously by a pool of processes and you weren’t using a standard library module such as multiprocessing.pool
or concurrent.futures
.
Finally, your construction of your Process
subclass requires a bit of tweaking (your constructor need to call the base class constructor and should not call run
). It’s also usual to name your classes with a capital letter, although I left the name unchanged. I also think it’s more usual not to subclass Process
. Generally one just writes a function and passes to the call to Process
target
, args
and/or kwargs
parameters.
Update to Use a Managed Dictionary
#!/usr/bin/env python3
import multiprocessing
import random
class employee(multiprocessing.Process):
def __init__(self, employees):
super().__init__() # init the base class !!!
self.employees = employees
print("**" * 100)
print("IN class:n", self.employees[:])
def run(self):
employees = self.employees
for i, employee in enumerate(employees):
new_name = employee["name"][::-1]
employee["name"] = new_name
employees[i] = employee # must be rewritten to show it has changed. Yuck!
#self.employees = employees
print("**" * 100)
print("IN class AFTER manipulation:n", employees[:])
def main():
manager = multiprocessing.Manager()
employees = manager.list()
for ii in range(5):
name = ''.join(random.choice(['a', 'b', 'c', 'd', 'e']) for i in range(10)) + "_" + str(ii)
employees.append({"name": name})
print("**" * 100)
print("BEFORE class:n", employees[:])
proc1 = employee(employees)
proc1.start()
proc1.join()
# time.sleep(1)
print("**" * 100)
print("AFTER class:n", employees[:])
if __name__ == "__main__":
main()