Download multiple azure blobs asynchronously
Question:
I am trying to improve the speed of downloading blobs from Azure. Using the package examples from azure, I have created my own example. However, it only works for a single file. I want to be able to pass in multiple customers in the form of a customer_id_list(commented out below) so that the files can be downloaded at the same time. However, I am unsure how to scale the aiohttp code to achieve this.
import asyncio
from azure.storage.blob.aio import BlobServiceClient
async def download_blob_to_file(blob_service_client: BlobServiceClient, container_name, transaction_date, customer_id):
blob_client = blob_service_client.get_blob_client(container=container_name, blob=f"{transaction_date}/{customer_id}.csv")
with open(file=f'{customer_id}.csv', mode="wb") as sample_blob:
download_stream = await blob_client.download_blob()
data = await download_stream.readall()
sample_blob.write(data)
async def main(transaction_date, customer_id):
connect_str = "connection-string"
blob_serv_client = BlobServiceClient.from_connection_string(connect_str)
async with blob_serv_client as blob_service_client:
await download_blob_to_file(blob_service_client, "sample-container", transaction_date, customer_id)
if __name__ == '__main__':
transaction_date = '20240409'
customer_id = '001'
# customer_id_list = ['001', '002', '003', '004']
asyncio.run(main(transaction_date, customer_id))
Answers:
Download multiple Azure blobs asynchronously.
You can use the code below with the asyncio.gather
function to run multiple coroutines concurrently, downloading multiple blobs at the same time.
Here is the modified code to download multiple Azure blobs asynchronously. In my environment, I created a CSV
file as you mentioned.
Portal:
Code:
import asyncio
from azure.storage.blob.aio import BlobServiceClient
async def download_blob_to_file(blob_service_client: BlobServiceClient, container_name, transaction_date, customer_id):
blob_client = blob_service_client.get_blob_client(container=container_name, blob=f"{transaction_date}/{customer_id}.csv")
with open(file=f'{customer_id}.csv', mode="wb") as sample_blob:
download_stream = await blob_client.download_blob()
data = await download_stream.readall()
sample_blob.write(data)
async def main(transaction_date, customer_id_list):
connect_str = "<connectionstring>"
blob_serv_client = BlobServiceClient.from_connection_string(connect_str)
async with blob_serv_client as blob_service_client:
tasks = []
for customer_id in customer_id_list:
task = asyncio.create_task(download_blob_to_file(blob_service_client, "test", transaction_date, customer_id))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__ == '__main__':
transaction_date = '20240409'
customer_id_list = ['001', '002', '003', '004']
asyncio.run(main(transaction_date, customer_id_list))
The above code uses a list of customer IDs as a parameter passed to the main function. Each job in the list, generated, is equivalent to downloading a single blob. To execute each operation concurrently, use the asyncio.gather
function. This helps to speed up the process of downloading several blobs.
Output:
I am trying to improve the speed of downloading blobs from Azure. Using the package examples from azure, I have created my own example. However, it only works for a single file. I want to be able to pass in multiple customers in the form of a customer_id_list(commented out below) so that the files can be downloaded at the same time. However, I am unsure how to scale the aiohttp code to achieve this.
import asyncio
from azure.storage.blob.aio import BlobServiceClient
async def download_blob_to_file(blob_service_client: BlobServiceClient, container_name, transaction_date, customer_id):
blob_client = blob_service_client.get_blob_client(container=container_name, blob=f"{transaction_date}/{customer_id}.csv")
with open(file=f'{customer_id}.csv', mode="wb") as sample_blob:
download_stream = await blob_client.download_blob()
data = await download_stream.readall()
sample_blob.write(data)
async def main(transaction_date, customer_id):
connect_str = "connection-string"
blob_serv_client = BlobServiceClient.from_connection_string(connect_str)
async with blob_serv_client as blob_service_client:
await download_blob_to_file(blob_service_client, "sample-container", transaction_date, customer_id)
if __name__ == '__main__':
transaction_date = '20240409'
customer_id = '001'
# customer_id_list = ['001', '002', '003', '004']
asyncio.run(main(transaction_date, customer_id))
Download multiple Azure blobs asynchronously.
You can use the code below with the asyncio.gather
function to run multiple coroutines concurrently, downloading multiple blobs at the same time.
Here is the modified code to download multiple Azure blobs asynchronously. In my environment, I created a CSV
file as you mentioned.
Portal:
Code:
import asyncio
from azure.storage.blob.aio import BlobServiceClient
async def download_blob_to_file(blob_service_client: BlobServiceClient, container_name, transaction_date, customer_id):
blob_client = blob_service_client.get_blob_client(container=container_name, blob=f"{transaction_date}/{customer_id}.csv")
with open(file=f'{customer_id}.csv', mode="wb") as sample_blob:
download_stream = await blob_client.download_blob()
data = await download_stream.readall()
sample_blob.write(data)
async def main(transaction_date, customer_id_list):
connect_str = "<connectionstring>"
blob_serv_client = BlobServiceClient.from_connection_string(connect_str)
async with blob_serv_client as blob_service_client:
tasks = []
for customer_id in customer_id_list:
task = asyncio.create_task(download_blob_to_file(blob_service_client, "test", transaction_date, customer_id))
tasks.append(task)
await asyncio.gather(*tasks)
if __name__ == '__main__':
transaction_date = '20240409'
customer_id_list = ['001', '002', '003', '004']
asyncio.run(main(transaction_date, customer_id_list))
The above code uses a list of customer IDs as a parameter passed to the main function. Each job in the list, generated, is equivalent to downloading a single blob. To execute each operation concurrently, use the asyncio.gather
function. This helps to speed up the process of downloading several blobs.
Output: