List All Files in a Folder Sitting in a Data Lake

Question:

I’m trying to get an inventory of all files in a folder, which has a few sub-folders, all of which sit in a data lake. Here is the code that I’m testing.

import sys, os
import pandas as pd

mylist = []
root = "/mnt/rawdata/parent/"
path = os.path.join(root, "targetdirectory") 

for path, subdirs, files in os.walk(path):
    for name in files:
        mylist.append(os.path.join(path, name))


df = pd.DataFrame(mylist)
print(df)

I also tried the sample code from this link:

Python list directory, subdirectory, and files

I’m working in Azure Databricks. I’m open to using Scala to do the job. So far, nothing has worked for me. Each time, I keep getting an empty dataframe. I believe this is pretty close, but I must be missing something small. Thoughts?

Asked By: ASH

||

Answers:

Databricks File System (DBFS) is a distributed file system mounted into an Azure Databricks workspace and available on Azure Databricks clusters. If you are using local file API you have to reference the Databricks filesystem. Azure Databricks configures each cluster node with a FUSE mount /dbfs that allows processes running on cluster nodes to read and write to the underlying distributed storage layer with local file APIs (see also the documentation).

So in the path /dbfs: has to be included:

root = "/dbfs/mnt/rawdata/parent/"

That is different then working with the Databricks Filesystem Utility (DBUtils). The file system utilities access Databricks File System, making it easier to use Azure Databricks as a file system:

dbutils.fs.ls("/mnt/rawdata/parent/")

For larger Data Lakes I can recommend a Scala example in the Knowledge Base.
Advantage is that it runs the listing for all child leaves distributed, so will work also for bigger directories.

Answered By: Hauke Mallow

I got this to work.

from azure.storage.blob import BlockBlobService 

blob_service = BlockBlobService(account_name='your_account_name', account_key='your_account_key')

blobs = []
marker = None
while True:
    batch = blob_service.list_blobs('rawdata', marker=marker)
    blobs.extend(batch)
    if not batch.next_marker:
        break
    marker = batch.next_marker
for blob in blobs:
    print(blob.name)

The only prerequisite is that you need to import azure.storage. So, in the Clusters window, click ‘Install-New’ -> PyPI > package = ‘azure.storage’. Finally, click ‘Install’.

Answered By: ASH

This worked for me – finding all the parquet’s in starting from a DBFS path:

#------
# find parquet files in subdirectories recursively
def find_parquets(dbfs_ls_list):
    parquet_list = []
    if isinstance(dbfs_ls_list, str):
        # allows for user to start recursion with just a path
        dbfs_ls_list = dbutils.fs.ls(root_dir)
        parquet_list += find_parquets(dbfs_ls_list)
    else:
        for file_data in dbfs_ls_list:
            if file_data.size == 0 and file_data.name[-1] == '/':
                # found subdir
                new_dbdf_ls_list = dbutils.fs.ls(file_data.path)
                parquet_list += find_parquets(new_dbdf_ls_list)
            elif '.parquet' in file_data.name:
                parquet_list.append(file_data.path)
    return parquet_list

#------
root_dir = 'dbfs:/FileStore/my/parent/folder/'
file_list = find_parquets(root_dir)
Answered By: Shaun Bowman