Download multiple azure blobs asynchronously

Question:

I am trying to improve the speed of downloading blobs from Azure. Using the package examples from azure, I have created my own example. However, it only works for a single file. I want to be able to pass in multiple customers in the form of a customer_id_list(commented out below) so that the files can be downloaded at the same time. However, I am unsure how to scale the aiohttp code to achieve this.

import asyncio
from azure.storage.blob.aio import BlobServiceClient

async def download_blob_to_file(blob_service_client: BlobServiceClient, container_name, transaction_date, customer_id):
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=f"{transaction_date}/{customer_id}.csv")
    with open(file=f'{customer_id}.csv', mode="wb") as sample_blob:
        download_stream = await blob_client.download_blob()
        data = await download_stream.readall()
        sample_blob.write(data)


async def main(transaction_date, customer_id):
    connect_str = "connection-string"
    blob_serv_client = BlobServiceClient.from_connection_string(connect_str)

    async with blob_serv_client as blob_service_client:
        await download_blob_to_file(blob_service_client, "sample-container", transaction_date, customer_id)

if __name__ == '__main__':
    transaction_date = '20240409'
    customer_id = '001'
    # customer_id_list = ['001', '002', '003', '004']
    asyncio.run(main(transaction_date, customer_id))

Asked By: Tim

||

Answers:

Download multiple Azure blobs asynchronously.

You can use the code below with the asyncio.gather function to run multiple coroutines concurrently, downloading multiple blobs at the same time.

Here is the modified code to download multiple Azure blobs asynchronously. In my environment, I created a CSV file as you mentioned.

Portal:
enter image description here

Code:

import asyncio
from azure.storage.blob.aio import BlobServiceClient

async def download_blob_to_file(blob_service_client: BlobServiceClient, container_name, transaction_date, customer_id):
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=f"{transaction_date}/{customer_id}.csv")
    with open(file=f'{customer_id}.csv', mode="wb") as sample_blob:
        download_stream = await blob_client.download_blob()
        data = await download_stream.readall()
        sample_blob.write(data)

async def main(transaction_date, customer_id_list):
    connect_str = "<connectionstring>"
    blob_serv_client = BlobServiceClient.from_connection_string(connect_str)

    async with blob_serv_client as blob_service_client:
        tasks = []
        for customer_id in customer_id_list:
            task = asyncio.create_task(download_blob_to_file(blob_service_client, "test", transaction_date, customer_id))
            tasks.append(task)
        await asyncio.gather(*tasks)

if __name__ == '__main__':
    transaction_date = '20240409'
    customer_id_list = ['001', '002', '003', '004']
    asyncio.run(main(transaction_date, customer_id_list))

The above code uses a list of customer IDs as a parameter passed to the main function. Each job in the list, generated, is equivalent to downloading a single blob. To execute each operation concurrently, use the asyncio.gather function. This helps to speed up the process of downloading several blobs.

Output:
enter image description here

Answered By: Venkatesan