How to list objects of one depth level without listing sub-objects by GCP Cloud Storage Python API?

Question

The Cloud Storage Python API allows to list objects using prefix, which limits the listing to certain sub-branches of objects in the bucket.

bucket_name = "my-bucket"
folders = "logs/app"
storage_client.list_blobs(bucket_name, prefix=folders)

This operations will return all objects which names start from "logs/app". But it will return absolutely all objects, including those which lay on deeper levels of hierarchy. For example, I’ve got many applications app=1, app=2, etc. So that the output will be like this:

logs/app=1
logs/app=1/module=1
logs/app=1/module=1/log_1.txt
logs/app=1/module=1/log_2.txt
logs/app=2
logs/app=2/module=1
logs/app=2/module=1/log_1.txt
logs/app=2/module=1/log_2.txt

and etc.
This operation of listing objects as it is mentioned above is scanning everything and because of that it’s slow. For example, if I’ve got 80K or 1M files stored in those folders, all of them will be scanned and returned.

I would like to get only result only for one depth level. For example, I would like to get only this:

logs/app=1
logs/app=2

And I don’t want the SDK to scan everything. Is there a way to achieve this? Maybe not with this API, maybe there is another Python SDK which could be used for this?

Asked By: Alexander Goida

||

Source

Answer 1

Unfortunately the Python API of cloud storage does not have a built-in method to list objects at a specific depth level. But as @Dharmaraj pointed out in comments the accepted answer with that corresponding thread would achieve this by filtering the returned results after listing all objects.

I may have just made the for-loop compact to 1 liner in this answer.

As top level folders consist only 1 / we can filter that accordingly. But this just filters the folders and gives only the objects that are at the desired depth level. In this case for the top level it will be /. So try the following:

bucket_name = "my-bucket"
folders = "logs/app"
results = storage_client.list_blobs(bucket_name, prefix=folders) #Note: [1]
required_objects = [blob for blob in results if '/' not in blob.name[len(folders):]]

1] At this point it will return absolutely all objects, including those which lay on deeper levels of hierarchy. Then we are just filtering ahead.

For more information go through list_blobs which returns an iterator used to find blobs in the bucket.

Answered By: Rohit Kharche

How to list objects of one depth level without listing sub-objects by GCP Cloud Storage Python API?

Question:

Answers: