How to get list_blobs to behave like gsutil

Question:

I would like to only get the first level of a fake folder structure on GCS.

If I run e.g.:


gsutil ls 'gs://gcp-public-data-sentinel-2/tiles/'

I get a list like this:

gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
gs://gcp-public-data-sentinel-2/tiles/11/
gs://gcp-public-data-sentinel-2/tiles/12/
gs://gcp-public-data-sentinel-2/tiles/13/
gs://gcp-public-data-sentinel-2/tiles/14/
gs://gcp-public-data-sentinel-2/tiles/15/
.
.
.

Running code like the following in the Python API give me an empty result:

from google.cloud import storage
bucket_name = 'gcp-public-data-sentinel-2'
prefix = 'tiles/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
for blob in bucket.list_blobs(max_results=10, prefix=prefix,
                              delimiter='/'):
    print blob.name

If I don’t use the delimiter option I get all the results in the bucket which is not very useful.

Asked By: cpaulik

||

Answers:

Maybe not the best way, but, inspired by this comment on the official repository:

iterator = bucket.list_blobs(delimiter='/', prefix=prefix)
response = iterator._get_next_page_response()
for prefix in response['prefixes']:
    print('gs://'+bucket_name+'/'+prefix)

Gives:

gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
...
Answered By: Mangu

If one finds this ticket like me after a long time: currently (google-cloud-storage 2.1.0) one can list the bucket contents using '//' instead of '/'. However, it lists "recursively" down to the actual blob (as it is not a real FS)

Answered By: Mischa Lisovyi

Here is a faster way (found this in a github thread, posted by @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920):

def list_gcs_directories(bucket, prefix):
    iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
    prefixes = set()
    for page in iterator.pages:
        print(page, page.prefixes)
        prefixes.update(page.prefixes)
    return prefixes

You want to call this function as follows:

client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')

# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
                  for indiv_folder in list_folders]

Answered By: Antoine Neidecker