Can I download a large file in the background using aiohttp?

Question:

I’d like to download a series of large (~200MB) files, and use the time while they’re downloading to do some CPU intensive processing. I’m investigating asyncio and aiohttp. My understanding is I can use them to start a large download and then do some heavy computation on the same thread while the download continues in the background.

What I am finding, however, is that the download is paused while the heavy CPU process continues, then resumes as soon as the calculation is done. I include a minimal example below. I visually monitor the process CPU and bandwidth while the script is running. It’s clear the download pauses during the ~30s of computation. Am I doing something wrong? Or am I not understanding what aiohttp can do?

import asyncio
import time

import aiofiles
import aiohttp


async def download(session):
    url = 'https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-s390x.sh'  # 280 MB file
    async with session.get(url) as resp:
        async with aiofiles.open('./tmpfile', mode='wb') as f:
            print('Starting the download')
            data = await resp.read()
            print('Starting the file write')
            await f.write(data)
            print('Download completed')


async def heavy_cpu_load():
    await asyncio.sleep(5)  # make sure the download has started
    print('Starting the computation')
    for i in range(200000000):  # takes about 30 seconds on my laptop.
        i ** 0.5
    print('Finished the computation')


async def main():
    async with aiohttp.ClientSession() as session:
        timer = time.time()
        tasks = [download(session), heavy_cpu_load()]
        await asyncio.gather(*tasks)
        print(f'All tasks completed in {time.time() - timer}s')


if __name__ == '__main__':
    asyncio.run(main())
Asked By: rowan_uk

||

Answers:

I think what happens is that the aiohttp did finish downloading the file, but in order to open it and read it1, it needs the GIL to release the lock, but the CPU load does not release the GIL until it finishes.

But, if you put await asyncio.sleep(0)2 after i ** 0.5 it will work.
await just to make sure if someone wants to take control of the GIL.


1 resp.read()
2 Common practice to release the GIL lock on purpose.

Answered By: Emanuel

if you have a calculation that doesn’t need to await something, then you can (and should) run it in a separate thread using loop.run_in_executor, so it will run in the background, threading is what enables the computer to run tasks concurrently (although not necessarily completely simultaneous)

import asyncio
import time

import aiofiles
import aiohttp

async def download(session):
    url = 'https://repo.anaconda.com/archive/Anaconda3-2022.10-Linux-s390x.sh'  # 280 MB file
    async with session.get(url) as resp:
        async with aiofiles.open('./tmpfile', mode='wb') as f:
            print('Starting the download')
            data = await resp.read()
            print('Starting the file write')
            await f.write(data)
            print('Download completed')

# not async
def heavy_cpu_load():
    print('Starting the computation')
    for i in range(200000000):  # takes about 30 seconds on my laptop.
        i ** 0.5
    print('Finished the computation')

async def main():
    async with aiohttp.ClientSession() as session:
        timer = time.time()
        tasks = [
            download(session), 
            asyncio.get_running_loop().run_in_executor(None, func=heavy_cpu_load)
        ]
        await asyncio.gather(*tasks)
        print(f'All tasks completed in {time.time() - timer}s')
asyncio.run(main())

using numpy (which drop the GIL) and aiohttp (network also drops the GIL) should allow your calculations and download to run almost in parallel using only threads, an alternative is to use ProcessPoolExecutor as outlined in the linked documentation to allow any function to run in parallel that doesn’t drop the gil (with a lot of extra limitations).

Answered By: Ahmed AEK
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.