Python asyncio: process a potentially infinite list

Question:

I have the following scenario:

  • Python 3.6+
  • The input data is read from a file, line by line.
  • A coroutine sends the data to an API (using aiohttp) and saves the result of the call to Mongo (using motor). So there’s a lot of IO going on.

The code is written using async / await, and works just fine for individual calls executed manually.

What I don’t know how to do is to consume the input data en masse.

All asyncio examples I’ve seen demonstrate asyncio.wait by sending a finite list as a parameter. But I can’t simply send a list of tasks to it, because the input file may have millions of rows.

My scenario is about streaming data as through a conveyor belt to a consumer.

What else can I do? I want the program to process the data in the file using all the resources it can muster, but without getting overwhelmed.

Asked By: Andrei

||

Answers:

My scenario is about streaming data as through a conveyor belt to a consumer. What else can I do?

You can create a fixed number of tasks roughly corresponding to the capacity of your conveyor belt, and pop them off a queue. For example:

async def consumer(queue):
    while True:
        line = await queue.get()
        # connect to API, Mongo, etc.
        ...
        queue.task_done()

async def producer():
    N_TASKS = 10
    loop = asyncio.get_event_loop()
    queue = asyncio.Queue(N_TASKS)
    tasks = [loop.create_task(consume(queue)) for _ in range(N_TASKS)]
    try:
        with open('input') as f:
            for line in f:
                await queue.put(line)
        await queue.join()
    finally:
        for t in tasks:
            t.cancel()

Since, unlike threads, tasks are lightweight and do not hog operating system resources, it is fine to err on the side of creating “too many” of them. asyncio can handle thousands of tasks without a hitch, although that is probably overkill for this tasks – tens will suffice.

Answered By: user4815162342

I’m having a similar situation. I’ve got a huge list of urls to scrape (NN millions).

So I came up with this solution:

import asyncio, random

urls = ['url1',....]

def get_url() -> str | None:
    global urls
    return urls.pop() if any(urls) else None


async def producer(queue: asyncio.Queue):
    while True:
        if queue.full():
            print(f"queue full ({queue.qsize()}), sleeping...")
            await asyncio.sleep(0.3)
            continue

        # produce a token and send it to a consumer
        url = get_url()
        if not url:
            break
        print(f"PRODUCED: {url}")
        await queue.put(url)
        await asyncio.sleep(0.1)


async def consumer(queue: asyncio.Queue):
    while True:
        url = await queue.get()
        # simulate I/O operation
        await asyncio.sleep(random.randint(1, 3))
        queue.task_done()
        print(f"CONSUMED: {url}")


async def main():
    concurrency = 3
    queue: asyncio.Queue = asyncio.Queue(concurrency)

    # fire up the both producers and consumers
    consumers = [asyncio.create_task(consumer(queue)) for _ in range(concurrency)]
    producers = [asyncio.create_task(producer(queue)) for _ in range(1)]

    # with both producers and consumers running, wait for
    # the producers to finish
    await asyncio.gather(*producers)
    print("---- done producing")

    # wait for the remaining tasks to be processed
    await queue.join()

    # cancel the consumers, which are now idle
    for c in consumers:
        c.cancel()


asyncio.run(main())

Since the list of urls to scrape is pretty huge, the producer waits for workers to become available before pushing another task in the queue.

Answered By: masroore