azure-data-lake

Pyarrow slice pushdown for Azure data lake

Pyarrow slice pushdown for Azure data lake Question: I want to access Parquet files on an Azure data lake, and only retrieve some rows. Here is a reproducible example, using a public dataset: import pyarrow.dataset as ds from adlfs import AzureBlobFileSystem abfs_public = AzureBlobFileSystem( account_name="azureopendatastorage") dataset_public = ds.dataset(‘az://nyctlc/yellow/puYear=2010/puMonth=1/part-00000-tid-8898858832658823408-a1de80bd-eed3-4d11-b9d4-fa74bfbd47bc-426339-18.c000.snappy.parquet’, filesystem=abfs_public) The processing time is the same …

Total answers: 2

Timeout error while uploading to large file in adls

Timeout error while uploading to large file in adls Question: I need to upload a 200 mb file to adls using python. I’m using the code provided in the official documentation – https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-directory-file-acl-python?tabs=azure-ad While calling the following function for upload – def upload_file_to_directory_bulk(): try: file_system_client = service_client.get_file_system_client(file_system="system") directory_client = file_system_client.get_directory_client("my-directory") file_client = directory_client.get_file_client("uploaded-file.txt") local_file = …

Total answers: 2

Write a variable's data as it is in ADLS file

Write a variable's data as it is in ADLS file Question: I want to write the content of a variable that is created dynamically in the program to a ADLS file. This is how I am getting the data – @dataclass class pipeline_run: id:str group_id:str run_start:str run_end:str pipeline_name:str pipeline_status:str parameters:str message:str addl_properties:str runs = adf_client.pipeline_runs.query_by_factory(rg_name, …

Total answers: 1

Problem when rename file in Azure Databricks from a data lake

Problem when rename file in Azure Databricks from a data lake Question: I am trying to rename a file with Python in Azure Databricks through the "import os" library using the "rename ()" function, it is something very simple really, but when doing it in Databricks I can’t get to the path where my file …

Total answers: 2

List All Files in a Folder Sitting in a Data Lake

List All Files in a Folder Sitting in a Data Lake Question: I’m trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I’m testing. import sys, os import pandas as pd mylist = [] root …

Total answers: 3

How to loop through Azure Datalake Store files in Azure Databricks

How to loop through Azure Datalake Store files in Azure Databricks Question: I am currently listing files in Azure Datalake Store gen1 successfully with the following command: dbutils.fs.ls(‘mnt/dbfolder1/projects/clients’) The structure of this folder is – client_comp_automotive_1.json [File] – client_comp_automotive_2.json [File] – client_comp_automotive_3.json [File] – client_comp_automotive_4.json [File] – PROCESSED [Folder] I want to loop through those …

Total answers: 3