How to get list_blobs to behave like gsutil
Question:
I would like to only get the first level of a fake folder structure on GCS.
If I run e.g.:
gsutil ls 'gs://gcp-public-data-sentinel-2/tiles/'
I get a list like this:
gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
gs://gcp-public-data-sentinel-2/tiles/11/
gs://gcp-public-data-sentinel-2/tiles/12/
gs://gcp-public-data-sentinel-2/tiles/13/
gs://gcp-public-data-sentinel-2/tiles/14/
gs://gcp-public-data-sentinel-2/tiles/15/
.
.
.
Running code like the following in the Python API give me an empty result:
from google.cloud import storage
bucket_name = 'gcp-public-data-sentinel-2'
prefix = 'tiles/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
for blob in bucket.list_blobs(max_results=10, prefix=prefix,
delimiter='/'):
print blob.name
If I don’t use the delimiter
option I get all the results in the bucket which is not very useful.
Answers:
Maybe not the best way, but, inspired by this comment on the official repository:
iterator = bucket.list_blobs(delimiter='/', prefix=prefix)
response = iterator._get_next_page_response()
for prefix in response['prefixes']:
print('gs://'+bucket_name+'/'+prefix)
Gives:
gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
...
If one finds this ticket like me after a long time: currently (google-cloud-storage 2.1.0
) one can list the bucket contents using '//'
instead of '/'
. However, it lists "recursively" down to the actual blob (as it is not a real FS)
Here is a faster way (found this in a github thread, posted by @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920):
def list_gcs_directories(bucket, prefix):
iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print(page, page.prefixes)
prefixes.update(page.prefixes)
return prefixes
You want to call this function as follows:
client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')
# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
for indiv_folder in list_folders]
I would like to only get the first level of a fake folder structure on GCS.
If I run e.g.:
gsutil ls 'gs://gcp-public-data-sentinel-2/tiles/'
I get a list like this:
gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
gs://gcp-public-data-sentinel-2/tiles/11/
gs://gcp-public-data-sentinel-2/tiles/12/
gs://gcp-public-data-sentinel-2/tiles/13/
gs://gcp-public-data-sentinel-2/tiles/14/
gs://gcp-public-data-sentinel-2/tiles/15/
.
.
.
Running code like the following in the Python API give me an empty result:
from google.cloud import storage
bucket_name = 'gcp-public-data-sentinel-2'
prefix = 'tiles/'
storage_client = storage.Client()
bucket = storage_client.get_bucket(bucket_name)
for blob in bucket.list_blobs(max_results=10, prefix=prefix,
delimiter='/'):
print blob.name
If I don’t use the delimiter
option I get all the results in the bucket which is not very useful.
Maybe not the best way, but, inspired by this comment on the official repository:
iterator = bucket.list_blobs(delimiter='/', prefix=prefix)
response = iterator._get_next_page_response()
for prefix in response['prefixes']:
print('gs://'+bucket_name+'/'+prefix)
Gives:
gs://gcp-public-data-sentinel-2/tiles/01/
gs://gcp-public-data-sentinel-2/tiles/02/
gs://gcp-public-data-sentinel-2/tiles/03/
gs://gcp-public-data-sentinel-2/tiles/04/
gs://gcp-public-data-sentinel-2/tiles/05/
gs://gcp-public-data-sentinel-2/tiles/06/
gs://gcp-public-data-sentinel-2/tiles/07/
gs://gcp-public-data-sentinel-2/tiles/08/
gs://gcp-public-data-sentinel-2/tiles/09/
gs://gcp-public-data-sentinel-2/tiles/10/
...
If one finds this ticket like me after a long time: currently (google-cloud-storage 2.1.0
) one can list the bucket contents using '//'
instead of '/'
. However, it lists "recursively" down to the actual blob (as it is not a real FS)
Here is a faster way (found this in a github thread, posted by @evanj https://github.com/GoogleCloudPlatform/google-cloud-python/issues/920):
def list_gcs_directories(bucket, prefix):
iterator = bucket.list_blobs(prefix=prefix, delimiter='/')
prefixes = set()
for page in iterator.pages:
print(page, page.prefixes)
prefixes.update(page.prefixes)
return prefixes
You want to call this function as follows:
client = storage.Client()
bucket_name = 'my_bucket_name'
bucket_obj = client.bucket(bucket_name)
list_folders = list_gcs_directories(bucket_obj, prefix='my/prefix/path/within/bucket/')
# Getting rid of the prefix
list_folders = [''.join(indiv_folder.split('/')[-1])
for indiv_folder in list_folders]