It is not possible to implement multithreaded programming in python with the multiprocessing library. The class is initialized several times

Question

At first, the error Can’t pickle local object was given. I found a solution to use multiprocess instead of the multiprocessing library, but now the class in which the method is located is initialized as many times as processors are used. Also, the data is either not saved, or is lost and incorrect data is obtained at the output.
`

import pathlib
import pandas as pd
from datetime import datetime
import csv
import re
import multiprocess as mp

class InputConect:
    def __init__(self):
        self.file_name = input('Введите название файла: ')
        self.filter_param = input('Введите название профессии: ')

    @staticmethod
    def print_data(file_name, filter_param):

        salary_by_years = {year: 0 for year in unique_years}
        vacs_by_years = {year: 0 for year in unique_years}
        vac_salary_by_years = {year: 0 for year in unique_years}
        vac_counts_by_years = {year: 0 for year in unique_years}

        def make_statistic(file):
            #writes data to the dictionary like this:
            salary_by_years[year] = int(one_year_vacancies.salary.mean())
            vacs_by_years[year] = one_year_vacancies.shape[0]

        if __name__ == '__main__':
             with mp.Pool() as p:
                 mp.freeze_support()
                 p.map(make_statistic, filelist)
                 p.close()
                 p.join()
            # m = mp.map(target=make_statistic, args=filelist)
            # m.start()
            # m.join()

        print('Динамика уровня зарплат по годам:', salary_by_years)
        print('Динамика количества вакансий по годам:', vacs_by_years)
        print('Динамика уровня зарплат по годам для выбранной профессии:', vac_salary_by_years)
        print('Динамика количества вакансий по годам для выбранной профессии:', vac_counts_by_years)

parameters = InputConect()
InputConect.print_data(parameters.file_name, parameters.filter_param)

`
Output:
Введите название файла: vacancies_by_year.csv
Введите название профессии: Аналитик
Введите название файла: Введите название файла: Введите название файла: Введите название файла: vacancies_by_year.csv
Введите название профессии: Аналитик
Введите название профессии: Аналитик
Введите название профессии: Аналитик
Введите название профессии: Аналитик

Динамика уровня зарплат по годам: {2007: 0, 2008: 0, 2009: 0, 2010: 0, 2011: 0, 2012: 0, 2013: 0, 2014: 0, 2015: 0, 2016: 0, 2017: 0, 2018: 0, 2019: 0, 2020: 0, 2021: 0, 2022: 0}

Динамика количества вакансий по годам: {2007: 0, 2008: 0, 2009: 0, 2010: 0, 2011: 0, 2012: 0, 2013: 0, 2014: 0, 2015: 0, 2016: 0, 2017: 0, 2018: 0, 2019: 0, 2020: 0, 2021: 0, 2022: 0}

Динамика уровня зарплат по годам для выбранной профессии: {2007: 0, 2008: 0, 2009: 0, 2010: 0, 2011: 0, 2012: 0, 2013: 0, 2014: 0, 2015: 0, 2016: 0, 2017: 0, 2018: 0, 2019: 0, 2020: 0, 2021: 0, 2022: 0}

Динамика количества вакансий по годам для выбранной профессии: {2007: 0, 2008: 0, 2009: 0, 2010: 0, 2011: 0, 2012: 0, 2013: 0, 2014: 0, 2015: 0, 2016: 0, 2017: 0, 2018: 0, 2019: 0, 2020: 0, 2021: 0, 2022: 0}

Asked By: MpirtGod

||

Source

Answer 1

This is much too long for a comment and so:

You have several issues with your code, but mainly you still do not have a minimal, reproducible example:

filelist in method print_data appears to be undefined and print_data is passed two arguments that are never referenced. This makes very little sense. In function make_statistics, the file argument is not referenced and one_year_vacancies is undefined. After dictionaries vac_salary_by_years and vac_counts_by_years are initialized they are never modified. Is that really correct?
You have if __name__ == '__main__': in the wrong place. This test only makes sense at module (global) scope in your main script.
Under Windows or any platform that uses the spawn method to create new processes, any worker function for your child process, make_statistic in your case, needs to be defined at module scope. Also, since this function is running in a different address space, it cannot modify the copies of salary_by_years and vacs_by_years that is in the main process.

You have also created a class InputConnect, but except for the __init__ method, all the other methods are static and therefore have no access to attributes file_name and filter_param. If you had method print_data not a static method, that would be a different situation. It would also make your class more reusable if the class were not responsible for inputting attributes file_name and filter_param or printing results but instead these values were passed and returned to the class instance. This would allow the class to be used where these values are not from console input and where the output needs to go somewhere other than the console. The idea is to separate business logic for input/output if you can.

This is the general idea (but I cannot fix the undefined and unreferenced variables that you have):

import pathlib
import pandas as pd
from datetime import datetime
import csv
import re
import multiprocess as mp

def make_statistic(file):
    #writes data to the dictionary like this:
    #salary_by_years[year] = int(one_year_vacancies.salary.mean())
    #vacs_by_years[year] = one_year_vacancies.shape[0]

    # Return necessary values:
    return year, int(one_year_vacancies.salary.mean()), one_year_vacancies.shape[0]
    

class InputConect:
    def __init__(file_name, filter_param):
        self.file_name = file_name
        self.filter_param = filer_param

    def compute(self):

        salary_by_years = {year: 0 for year in unique_years}
        vacs_by_years = {year: 0 for year in unique_years}
        vac_salary_by_years = {year: 0 for year in unique_years}
        vac_counts_by_years = {year: 0 for year in unique_years}

        with mp.Pool() as p:
            mp.freeze_support()
            results = p.map(make_statistic, filelist)
            # Process each tuple returned by `make_statistic`:
            for year, value1, value2 in results: # Unpack the tuple
                salary_by_years[year] = value1
                vacs_by_years[year] = value2
                
            p.close()
            p.join()

        # Return values rather than printing for greater reusability
        return salary_by_years, vacs_by_years, vac_salary_by_years, vac_counts_by_years

if __name__ == '__main__':
    file_name = input('Введите название файла: ')
    filter_param = input('Введите название профессии: ')
    input_connect = InputConect(file_name, filter_param)
    salary_by_years, vacs_by_years, vac_salary_by_years, vac_counts_by_years = input_connect.compute()

    print('Динамика уровня зарплат по годам:', salary_by_years)
    print('Динамика количества вакансий по годам:', vacs_by_years)
    print('Динамика уровня зарплат по годам для выбранной профессии:', vac_salary_by_years)
    print('Динамика количества вакансий по годам для выбранной профессии:', vac_counts_by_years)

Answered By: Booboo

Answer 2

I realized that the classes were not needed, removed them and the problem of Not being able to determine whether a local object was set was solved. And the problem with multiple calls was due to incorrect statement if name == ‘main‘. As Boo boo said: You have if name == ‘main‘: in the wrong place. This test only makes sense in the module area (global) in your main script.

Answered By: MpirtGod

It is not possible to implement multithreaded programming in python with the multiprocessing library. The class is initialized several times

Question:

Answers: