Connecting to Azure storage account to read parquet file via managed identity using polars library

Question:

I am using python version of the polars library to read a parquet file with large no of rows . Here is the link to the library – https://github.com/pola-rs/polars

I am trying to read a parquet file from Azure storage account using the read_parquet method . I can see there is a storage_options argument which can be used to specify how to connect to the data storage.Here is the definition of the of read_parquet method –

def read_parquet(
    source: str | Path | BinaryIO | BytesIO | bytes,
    columns: list[int] | list[str] | None = None,
    n_rows: int | None = None,
    use_pyarrow: bool = False,
    memory_map: bool = True,
    storage_options: dict[str, object] | None = None,
    parallel: ParallelStrategy = "auto",
    row_count_name: str | None = None,
    row_count_offset: int = 0,
    low_memory: bool = False,
    pyarrow_options: dict[str, object] | None = None,
) -> DataFrame:

Can anyone let me know what values do I need to provide as part of the storage_options to connect to the Azure storage account if I am using a system assigned managed identity. Unfortunately I could not find any example for this . Most of the examples are using connection string and access keys and due to security reasons I cannot use them.

edit : I just came to know that the storage_options are passed to another library called ffspec. But I have no idea about it.

Asked By: Niladri

||

Answers:

This code should work:

import pandas as pd
storage_options = {'account_name' : '<account>', 'sas_token' : '<token>'}
df = pd.read_parquet('abfs://<container>@<account>.dfs.core.windows.net/<parquet path>', storage_options = storage_options)
Answered By: user21081045

I finally figured out the solution, anyone who is looking to use managed identity to connect to azure data lake storage gen2 account follow the below steps.
As someone mentioned in the comments, polars is using fsspec and adlfs python library to connect to remote files in Azure Cloud. To connect using managed identity we can use the below code –

import polars as pl

storage_options={'account_name': ACCOUNT_NAME, 'anon': False}
df = pl.read_parquet(path=<remote-file-path>,columns=<list of columns>,storage_options=storage_options)

This will try to use DefaultAzureCredential from azure.identity library to connect to the storage account. If you already have managed identity enabled for your Azure resource with proper RBAC permission, you should be able to connect.

Documentation : https://github.com/fsspec/adlfs#setting-credentials

Answered By: Niladri
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.