Dictionary is empty after insert tons of items into it – Multiprocess sharing global variables

Question:

I am scrapping a page source already downloaded, and I am using multiprocessing to get this faster. Each file is a page with N news/articles, after scrapped-off desired info, each news is save in a dictionary with id as key. In the end I would to convert this dictionary to json

def handler(self, file):
        articles = self.open_text_file(self.path+file)
        for n in articles:
            try:
                id = self.investing.extract_id(n)
                image_url = self.investing.extract_image(n)
                url, title = self.investing.extract_title(n)
                id = url.split('-')[-1]
                text = self.investing.extract_text(n)

                self.news[id] = { 'url' : self.base_url+url,
                                  'image' : image_url,
                                  'title' : title,
                                  'text' : text}
            except KeyError:
                pass

    def extract_info(self):
        task = self.create_tasks()
        with Pool(CPU_COUNT) as pool:
            pool.map(self.handler, task)

        with open('financial_news.json', 'a', encoding='utf8') as  file:
            json.dump(self.news, file, indent=4)

This return a empty json file, without any news. Have you guys any idea what I am doing wrong?

I would like to have a dict something like this

"2998451": {
    "url": 
    "image": 
    "title": 
    "text": 
},
"2998427": {
    "url": 
    "image": 
    "title": 
    "text": 
},
"2998412": {
    "url": 
    "image": 
    "title": 
    "text": 
}
Asked By: Bruno Finger

||

Answers:

You are using multiprocessing. Each process gets its own copy of self. Each process is modifying its own copy of self.news, and the self.news is your main process is never touched.

The simplest way to achieve what you want is to have your threads put their partial results on a multiprocessing queue, and then your main thread can read the partial results off that queue and save them in the one and only self.news that matters.


Actually, I’ve changed my mind.

   def handler(self, file):
        news = {}
        ....
        ... same code but use news rather than self.news ...
        ...
        return news

    def extract_info(self):
        self.news = {}
        task = self.create_tasks()
        with Pool(CPU_COUNT) as pool:
            for news in pool.map(self.handler, task):
                self.news |= news
        ...

More or less the same code you’ve got, but the tasks don’t try to set a global variable. They just return their result to the main task, when uses it to update the main instance of self.news.

If you’re using Python older than 3.9, use self.news.update(news).

It may also be more efficient to use imap or imap_unordered. You don’t care what order your results come back. See the documentation for more details.


If you do decide you want to use queues:

    def handler(self, task, queue):
       ......
       for n in articles:
           ....
           queue.add([id, .... info ....])
       queue.add(None) # Marker indicating we're done

    def extract_info(self):
       manager = mp.Manager()
       queue = manager.Queue()
       count = len(task)

       with Pool(CPU_COUNT) as pool:
           pool.starmap(self.handler, [(t, queue) for t in task])
           while count > 0:
               item = queue.get()
               if item is None: 
                    count -= 1
               else:
                    id, value = item
                    news[id] = value
        ....

Answered By: Frank Yellin
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.