Python Async what is causing the memory leak?

Question:

I am downloading zip files and looking inside them to check their contents for a few million items, but I am constantly accruing memory and I will eventually go OOM, even with small semaphores.

Consider the block:

    async def zip_reader(self, blobFileName, blobEndPoint, semaphore):

        try:
            # access blob
            async with ClientSecretCredential(TENANT, CLIENTID, CLIENTSECRET) as credential:
                async with BlobServiceClient(account_url="https://blob1.blob.core.windows.net/", credential=credential, max_single_get_size=64 * 1024 * 1024, max_chunk_get_size=32 * 1024 * 1024) as blob_service_client:
                    async with blob_service_client.get_blob_client(container=blobEndPoint, blob=blobFileName) as blob_client:
                        async with semaphore:
                            logger.info(f"Starting: {blobFileName}, {blobEndPoint}")

                            # open bytes
                            writtenbytes = io.BytesIO()

                            # write file to it
                            stream = await blob_client.download_blob(max_concurrency=25)
                            stream = await stream.readinto(writtenbytes)

                            # zipfile
                            f = ZipFile(writtenbytes)

                            # file list
                            file_list = [s for s in f.namelist()]

                            # send to df
                            t_df = pd.DataFrame({'fileList': file_list})

                            # add fileName
                            t_df['blobFileName'] = blobFileName
                            t_df['blobEndPoint'] = blobEndPoint

                            if semaphore.locked():
                                await asyncio.sleep(1)

                            logger.info(f"Completed: {blobFileName}")

                            # clean up here; also tried del on objs here as well
                            self.cleanup()

                            return t_df


    async def cleanup(self):
        gc.collect()
        await asyncio.sleep(1)


    async def async_file_as_bytes_generator(self, blobFileName, blobEndPoint, semaphore):
        """
        main caller
        """
        semaphore = asyncio.Semaphore(value=semaphore)
        return await asyncio.gather(*[self.zip_reader(fn, ep, semaphore) for fn, ep in zip(blobFileName, blobEndPoint)], # also tried attaching here)

Asked By: John Stud

||

Answers:

asyncio.gather has no strategy to limit the number of simultaneous tasks in execution at all. Your semaphore may limit how many are being fetched and processed at once – but gather will wait for all data frames to be avalible, and return all at once.

Instead of using a single await asyncio.gather use something like asyncio.wait with a timeout, and keep control of how many tasks are running, yielding the complete dataframes as they become ready.

And then, you didn’t show the remaining of your program leading to the call to async_file_as_bytes_generator, but it will have to consume the dataframes as they are yielded and dispose of them, of course.

Also: no need to do explicit calls to gc.collect ever: this is a no-operation. Python does free your memory if your program is correct, and keep no references to objects consuming it. Otherwise there is nothing gc.collect could do anyway.

Your "main caller" can be something along this – but as I denoted, you have to check the code that calls it so that it consumes each dataframe at once, and not expect a list with all dataframes as your current code do.



async def async_file_as_bytes_generator(self, blobFileName, blobEndPoint, task_limit):
    """
    main caller
    """
    semaphore = asyncio.Semaphore(value=task_limit)
    
    all_tasks = {self.zip_reader(fn, ep, semaphore) for fn, ep in zip(blobFileName, blobEndPoint)}
    current_tasks = set()
    while all_tasks or current_tasks:
        while all_tasks and len(current_tasks < task_limit):
            current_tasks.add(all_tasks.pop())
            
        done, incomplete = await asyncio.wait(current_tasks, return_when=asyncio.FIRST_COMPLETED)
        for task in done:
            # optionally check for task exception
            yield task.result()
        current_tasks = incomplete
Answered By: jsbueno
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.