Boto3 S3: Get files without getting folders
Question:
Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders?
Consider the following file structure:
file_1.txt
folder_1/
file_2.txt
file_3.txt
folder_2/
folder_3/
file_4.txt
In this example Im only interested in the 4 files.
EDIT:
A manual solution is:
def count_files_in_folder(prefix):
total = 0
keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)
for key in keys['Contents']:
if key['Key'][-1:] != '/':
total += 1
return total
In this case total would be 4.
If I just did
count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix))
the result would be 7 objects (4 files and 3 folders):
file.txt
folder_1/
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/
folder_1/folder_2/folder_3/
folder_1/folder_2/folder_3/file_4.txt
I JUST want:
file.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt
Answers:
There are no folders in S3. What you have is four files named:
file_1.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt
Those are the actual names of the objects in S3. If what you want is to end up with:
file_1.txt
file_2.txt
file_3.txt
file_4.txt
all sitting in the same directory on a local file system you would need to manipulate the name of the object to strip out just the file name. Something like this would work:
import os.path
full_name = 'folder_1/folder_2/folder_3/file_4.txt'
file_name = os.path.basename(full_name)
The variable file_name
would then contain 'file_4.txt'
.
S3 is an OBJECT STORE. It DOES NOT store file/object under directories tree.
New comer always confuse the “folder” option given by them, which in fact an arbitrary prefix for the object.
object PREFIX
is a way to retrieve your object organised by predefined fix file name(key) prefix structure, e.g. .
You can imagine using a file system that don’t allow you to create a directory, but allow you to create file name with a slash “/” or backslash “” as delimiter, and you can denote “level” of the file by a common prefix.
Thus in S3, you can use following to “simulate directory” that is not a directory.
folder1-folder2-folder3-myobject
folder1/folder2/folder3/myobject
folder1folder2folder3myobject
As you can see, object name can store inside S3 regardless what kind of arbitrary folder separator(delimiter) you use.
However, to help user to make bulks file transfer to S3, tools such as aws cli, s3_transfer api attempt to simplify the step and create object name follow your input local folder structure.
So if you are sure that all the S3 object is using /
or
as separator , you can use tools like S3transfer or AWSCcli to make a simple download by using the key name.
Here is the quick and dirty code using the resource iterator. Using s3.resource.object.filter will return iterator that doesn’t have same 1000 keys limit as list_objects()/list_objects_v2().
import os
import boto3
s3 = boto3.resource('s3')
mybucket = s3.Bucket("mybucket")
# if blank prefix is given, return everything)
bucket_prefix="/some/prefix/here"
objs = mybucket.objects.filter(
Prefix = bucket_prefix)
for obj in objs:
path, filename = os.path.split(obj.key)
# boto3 s3 download_file will throw exception if folder not exists
try:
os.makedirs(path)
except FileExistsError:
pass
mybucket.download_file(obj.key, obj.key)
One way to filter out folders is by checking the end character of the Object if you are certain that no files end in a forward slash:
for object_summary in objects.all():
if object_summary.key[-1] == "/":
continue
As stated in the other answers, s3 does not actually have directories trees. But there is a convenient workaround taking advantage of the fact that the s3 “folders” have zero size by using paginators. This code-snippet will print out the desired output if all your files in the bucket have size > 0 (of course you need to adapt your region) :
bucket_name = "bucketname"
s3 = boto3.client('s3', region_name='eu-central-1')
paginator = s3.get_paginator('list_objects')
[print(page['Key']) for page in paginator.paginate(Bucket=bucket_name).search("Contents[?Size > `0`][]")]
The filtering is done using JMESPath.
Note: Of course this would also exclude files with size 0, but usually you don’t need storage for empty files.
Using v2
you also get the size of the file, so you can filter the keys.
s3_client
.list_objects_v2(bucket: bucket_name, prefix: prefix)
.select { |e| e[:size] > 0 }
.map { |e| e[:key] }
Following up on @airborne answer, you can use JMESPath to filter all keys that end with a
This will still return empty files but will filter out all none file keys (unless you have a file name that ends with which will force you to try to get the content to make sure it is a file).
import boto3
s3 = boto3.client('s3')
def count_files_in_folder(bucket_name: str prefix: str) -> int:
paginator = s3.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket=bucket_name, Prefix=prefix).search("Contents[? !ends_with(key, '/')]")
return len(result)
This will return all the keys without any pagination.
Using boto3, how can I retrieve all files in my S3 bucket without retrieving the folders?
Consider the following file structure:
file_1.txt
folder_1/
file_2.txt
file_3.txt
folder_2/
folder_3/
file_4.txt
In this example Im only interested in the 4 files.
EDIT:
A manual solution is:
def count_files_in_folder(prefix):
total = 0
keys = s3_client.list_objects(Bucket=bucket_name, Prefix=prefix)
for key in keys['Contents']:
if key['Key'][-1:] != '/':
total += 1
return total
In this case total would be 4.
If I just did
count = len(s3_client.list_objects(Bucket=bucket_name, Prefix=prefix))
the result would be 7 objects (4 files and 3 folders):
file.txt
folder_1/
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/
folder_1/folder_2/folder_3/
folder_1/folder_2/folder_3/file_4.txt
I JUST want:
file.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt
There are no folders in S3. What you have is four files named:
file_1.txt
folder_1/file_2.txt
folder_1/file_3.txt
folder_1/folder_2/folder_3/file_4.txt
Those are the actual names of the objects in S3. If what you want is to end up with:
file_1.txt
file_2.txt
file_3.txt
file_4.txt
all sitting in the same directory on a local file system you would need to manipulate the name of the object to strip out just the file name. Something like this would work:
import os.path
full_name = 'folder_1/folder_2/folder_3/file_4.txt'
file_name = os.path.basename(full_name)
The variable file_name
would then contain 'file_4.txt'
.
S3 is an OBJECT STORE. It DOES NOT store file/object under directories tree.
New comer always confuse the “folder” option given by them, which in fact an arbitrary prefix for the object.
object PREFIX
is a way to retrieve your object organised by predefined fix file name(key) prefix structure, e.g. .
You can imagine using a file system that don’t allow you to create a directory, but allow you to create file name with a slash “/” or backslash “” as delimiter, and you can denote “level” of the file by a common prefix.
Thus in S3, you can use following to “simulate directory” that is not a directory.
folder1-folder2-folder3-myobject
folder1/folder2/folder3/myobject
folder1folder2folder3myobject
As you can see, object name can store inside S3 regardless what kind of arbitrary folder separator(delimiter) you use.
However, to help user to make bulks file transfer to S3, tools such as aws cli, s3_transfer api attempt to simplify the step and create object name follow your input local folder structure.
So if you are sure that all the S3 object is using /
or as separator , you can use tools like S3transfer or AWSCcli to make a simple download by using the key name.
Here is the quick and dirty code using the resource iterator. Using s3.resource.object.filter will return iterator that doesn’t have same 1000 keys limit as list_objects()/list_objects_v2().
import os
import boto3
s3 = boto3.resource('s3')
mybucket = s3.Bucket("mybucket")
# if blank prefix is given, return everything)
bucket_prefix="/some/prefix/here"
objs = mybucket.objects.filter(
Prefix = bucket_prefix)
for obj in objs:
path, filename = os.path.split(obj.key)
# boto3 s3 download_file will throw exception if folder not exists
try:
os.makedirs(path)
except FileExistsError:
pass
mybucket.download_file(obj.key, obj.key)
One way to filter out folders is by checking the end character of the Object if you are certain that no files end in a forward slash:
for object_summary in objects.all():
if object_summary.key[-1] == "/":
continue
As stated in the other answers, s3 does not actually have directories trees. But there is a convenient workaround taking advantage of the fact that the s3 “folders” have zero size by using paginators. This code-snippet will print out the desired output if all your files in the bucket have size > 0 (of course you need to adapt your region) :
bucket_name = "bucketname"
s3 = boto3.client('s3', region_name='eu-central-1')
paginator = s3.get_paginator('list_objects')
[print(page['Key']) for page in paginator.paginate(Bucket=bucket_name).search("Contents[?Size > `0`][]")]
The filtering is done using JMESPath.
Note: Of course this would also exclude files with size 0, but usually you don’t need storage for empty files.
Using v2
you also get the size of the file, so you can filter the keys.
s3_client
.list_objects_v2(bucket: bucket_name, prefix: prefix)
.select { |e| e[:size] > 0 }
.map { |e| e[:key] }
Following up on @airborne answer, you can use JMESPath to filter all keys that end with a
This will still return empty files but will filter out all none file keys (unless you have a file name that ends with which will force you to try to get the content to make sure it is a file).
import boto3
s3 = boto3.client('s3')
def count_files_in_folder(bucket_name: str prefix: str) -> int:
paginator = s3.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket=bucket_name, Prefix=prefix).search("Contents[? !ends_with(key, '/')]")
return len(result)
This will return all the keys without any pagination.