How to calculate a Directory size in ADLS using PySpark?
Question:
I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files.
I want total size of all the files and everything inside XYZ.
I could find out all the folders inside a particular path. But I want size of all together.
Also I see
display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))
gives me data size of abc file.
But I want complete size of XYZ.
Answers:
The dbutils.fs.ls
doesn’t have a recurse functionality like cp
, mv
or rm
. Thus, you need to iterate yourself. Here is a snippet that will do the task for you. Run the code from a Databricks Notebook.
from dbutils import FileInfo
from typing import List
root_path = "/mnt/datalake/.../XYZ"
def discover_size(path: str, verbose: bool = True):
def loop_path(paths: List[FileInfo], accum_size: float):
if not paths:
return accum_size
else:
head, tail = paths[0], paths[1:]
if head.size > 0:
if verbose:
print(f"{head.path}: {head.size / 1e6} MB")
accum_size += head.size / 1e6
return loop_path(tail, accum_size)
else:
extended_tail = dbutils.fs.ls(head.path) + tail
return loop_path(extended_tail, accum_size)
return loop_path(dbutils.fs.ls(path), 0.0)
discover_size(root_path, verbose=True) # Total size in megabytes at the end
If the location is mounted in the dbfs. Then you could use the du -h
approach (have not test it). If you are in the Notebook, create a new cell with:
%sh
du -h /mnt/datalake/.../XYZ
Try using the dbutils ls command, get the list of files in a dataframe and query by using aggregate function SUM() on size column:
val fsds = dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet").toDF
fsds.createOrReplaceTempView("filesList")
display(spark.sql("select COUNT(name) as NoOfRows, SUM(size) as sizeInBytes from fileListPROD"))
The @Emer answer is good, but can hit a RecursionError: maximum recursion depth exceeded
really quickly, because it does a recursion for each files (if you have X files you will have X imbricated recursions).
Here is the same thing with recursion only for folders:
%python
from dbutils import FileInfo
from typing import List
def discover_size2(path: str, verbose: bool = True):
def loop_path(path: str):
accum_size = 0.0
path_list = dbutils.fs.ls(path)
if path_list:
for path_object in path_list:
if path_object.size > 0:
if verbose:
print(f"{path_object.path}: {path_object.size / 1e6} MB")
accum_size += path_object.size / 1e6
else:
# Folder: recursive discovery
accum_size += loop_path(path_object.path)
return accum_size
return loop_path(path)
Love the answer by Emer!
Small addition:
If you met
"ModuleNotFoundError: No module named ‘dbutils’"
try this from dbruntime.dbutils
instead of from dbutils
. It works for me!
— Shizheng
For anyone still hitting recursion limits with @robin loche’s approach, here is a purely iterative answer:
# from dbutils import FileInfo # Not required in databricks
# from dbruntime.dbutils import FileInfo # may work for some people
def get_size_of_path(path):
return sum([file.size for file in get_all_files_in_path(path)])
def get_all_files_in_path(path, verbose=False):
nodes_new = []
nodes_new = dbutils.fs.ls(path)
files = []
while len(nodes_new) > 0:
current_nodes = nodes_new
nodes_new = []
for node in current_nodes:
if verbose:
print(f"Processing {node.path}")
children = dbutils.fs.ls(node.path)
for child in children:
if child.size == 0 and child.path != node.path:
nodes_new.append(child)
elif child.path != node.path:
files.append(child)
return files
path = "s3://some/path/"
print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")
I want to calculate a directory(e.g- XYZ) size which contains sub folders and sub files.
I want total size of all the files and everything inside XYZ.
I could find out all the folders inside a particular path. But I want size of all together.
Also I see
display(dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet"))
gives me data size of abc file.
But I want complete size of XYZ.
The dbutils.fs.ls
doesn’t have a recurse functionality like cp
, mv
or rm
. Thus, you need to iterate yourself. Here is a snippet that will do the task for you. Run the code from a Databricks Notebook.
from dbutils import FileInfo
from typing import List
root_path = "/mnt/datalake/.../XYZ"
def discover_size(path: str, verbose: bool = True):
def loop_path(paths: List[FileInfo], accum_size: float):
if not paths:
return accum_size
else:
head, tail = paths[0], paths[1:]
if head.size > 0:
if verbose:
print(f"{head.path}: {head.size / 1e6} MB")
accum_size += head.size / 1e6
return loop_path(tail, accum_size)
else:
extended_tail = dbutils.fs.ls(head.path) + tail
return loop_path(extended_tail, accum_size)
return loop_path(dbutils.fs.ls(path), 0.0)
discover_size(root_path, verbose=True) # Total size in megabytes at the end
If the location is mounted in the dbfs. Then you could use the du -h
approach (have not test it). If you are in the Notebook, create a new cell with:
%sh
du -h /mnt/datalake/.../XYZ
Try using the dbutils ls command, get the list of files in a dataframe and query by using aggregate function SUM() on size column:
val fsds = dbutils.fs.ls("/mnt/datalake/.../XYZ/.../abc.parquet").toDF
fsds.createOrReplaceTempView("filesList")
display(spark.sql("select COUNT(name) as NoOfRows, SUM(size) as sizeInBytes from fileListPROD"))
The @Emer answer is good, but can hit a RecursionError: maximum recursion depth exceeded
really quickly, because it does a recursion for each files (if you have X files you will have X imbricated recursions).
Here is the same thing with recursion only for folders:
%python
from dbutils import FileInfo
from typing import List
def discover_size2(path: str, verbose: bool = True):
def loop_path(path: str):
accum_size = 0.0
path_list = dbutils.fs.ls(path)
if path_list:
for path_object in path_list:
if path_object.size > 0:
if verbose:
print(f"{path_object.path}: {path_object.size / 1e6} MB")
accum_size += path_object.size / 1e6
else:
# Folder: recursive discovery
accum_size += loop_path(path_object.path)
return accum_size
return loop_path(path)
Love the answer by Emer!
Small addition:
If you met
"ModuleNotFoundError: No module named ‘dbutils’"
try this from dbruntime.dbutils
instead of from dbutils
. It works for me!
— Shizheng
For anyone still hitting recursion limits with @robin loche’s approach, here is a purely iterative answer:
# from dbutils import FileInfo # Not required in databricks
# from dbruntime.dbutils import FileInfo # may work for some people
def get_size_of_path(path):
return sum([file.size for file in get_all_files_in_path(path)])
def get_all_files_in_path(path, verbose=False):
nodes_new = []
nodes_new = dbutils.fs.ls(path)
files = []
while len(nodes_new) > 0:
current_nodes = nodes_new
nodes_new = []
for node in current_nodes:
if verbose:
print(f"Processing {node.path}")
children = dbutils.fs.ls(node.path)
for child in children:
if child.size == 0 and child.path != node.path:
nodes_new.append(child)
elif child.path != node.path:
files.append(child)
return files
path = "s3://some/path/"
print(f"Size of {path} in gb: {get_size_of_path(path) / 1024 / 1024 / 1024}")