How to read from Azure Blob Storage with Python delta-rs

Question:

I’d like to use the Python bindings to delta-rs to read from my blob storage.

Currently I am kind of lost, since I cannot figure out how to configure the filesystem on my local machine. Where do I have to put my credentials?

Can I use adlfs for this?

from adlfs import AzureBlobFileSystem
    
fs = AzureBlobFileSystem(
        account_name="...", 
        account_key='...'
    )

and then use the fs object?

Asked By: user7454972

||

Answers:

Unfortunately we don’t have great documentation around this at the moment. You should be able to set AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_SAS environment variables a la this integration test.

That will ensure the Python bindings can access table metadata, but typically fetching of the data for query is done through Pandas, and I’m not sure if Pandas will handle these variables as well (not an ADLSv2 user myself)..

Answered By: rtyler

One possible workaround is to download the delta lake files to a tmp-dir and read the files using python-delta-rs with something like this:

from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable

def get_blobs_for_folder(container_client, blob_storage_folder_path):
    blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
    blob_names = []
    for blob in blob_iter:
        if "." in blob.name:
            # To just get files and not directories, there might be a better way to do this
            blob_names.append(blob.name)

    return blob_names


def download_blob_files(container_client, blob_names, local_folder):
    for blob_name in blob_names:
        local_filename = os.path.join(local_folder, blob_name)
        local_file_dir = os.path.dirname(local_filename)
        if not os.path.exists(local_file_dir):
            os.makedirs(local_file_dir)

        with open(local_filename, 'wb') as f:
            f.write(container_client.download_blob(blob_name).readall())


def read_delta_lake_file_to_df(blob_storage_path, access_key):
    blob_storage_url = "https://your-blob-storage"
    blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
    container_client = blob_service_client.get_container_client("your-container-name")

    blob_names = get_blobs_for_folder(container_client, blob_storage_path)
    with tempfile.TemporaryDirectory() as tmp_dirpath:
        download_blob_files(container_client, blob_names, tmp_dirpath)
        local_filename = os.path.join(tmp_dirpath, blob_storage_path)
        dt = DeltaTable(local_filename)
        df = dt.to_pyarrow_table().to_pandas()
    return df


Answered By: Simen Holmestad

I don’t know about delta-rs but you can use this object directly with pandas.

abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)
Answered By: Atharva Khutale