Python boto, list contents of specific dir in bucket
Question:
I have S3 access only to a specific directory in an S3 bucket.
For example, with the s3cmd
command if I try to list the whole bucket:
$ s3cmd ls s3://bucket-name
I get an error: Access to bucket 'my-bucket-url' was denied
But if I try access a specific directory in the bucket, I can see the contents:
$ s3cmd ls s3://bucket-name/dir-in-bucket
Now I want to connect to the S3 bucket with python boto. Similary with:
bucket = conn.get_bucket('bucket-name')
I get an error: boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
But if I try:
bucket = conn.get_bucket('bucket-name/dir-in-bucket')
The script stalls for about 10 seconds, and prints out an error afterwards. Bellow is the full trace. Any idea how to proceed with this?
Note question is about the boto version 2 module, not boto3.
Traceback (most recent call last):
File "test_s3.py", line 7, in <module>
bucket = conn.get_bucket('bucket-name/dir-name')
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 471, in get_bucket
return self.head_bucket(bucket_name, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 490, in head_bucket
response = self.make_request('HEAD', bucket_name, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 633, in make_request
retry_handler=retry_handler
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1046, in make_request
retry_handler=retry_handler)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 922, in _mexe
request.body, request.headers)
File "/usr/lib/python2.7/httplib.py", line 958, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 992, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 954, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 814, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 776, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 1157, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known
Answers:
By default, when you do a get_bucket
call in boto it tries to validate that you actually have access to that bucket by performing a HEAD
request on the bucket URL. In this case, you don’t want boto to do that since you don’t have access to the bucket itself. So, do this:
bucket = conn.get_bucket('my-bucket-url', validate=False)
and then you should be able to do something like this to list objects:
for key in bucket.list(prefix='dir-in-bucket'):
<do something>
If you still get a 403 Errror, try adding a slash at the end of the prefix.
for key in bucket.list(prefix='dir-in-bucket/'):
<do something>
Note: this answer was written about the boto version 2 module, which is obsolete by now. At the moment (2020), boto3 is the standard module for working with AWS. See this question for more info: What is the difference between the AWS boto and boto3
If you want to list all the objects of a folder in your bucket, you can specify it while listing.
import boto
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(AWS_BUCKET_NAME)
for file in bucket.list("FOLDER_NAME/", "/"):
<do something with required file>
For boto3
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('my_bucket_name')
for object_summary in my_bucket.objects.filter(Prefix="dir_name/"):
print(object_summary.key)
Boto3 client:
import boto3
_BUCKET_NAME = 'mybucket'
_PREFIX = 'subfolder/'
client = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
def ListFiles(client):
"""List files in specific S3 URL"""
response = client.list_objects(Bucket=_BUCKET_NAME, Prefix=_PREFIX)
for content in response.get('Contents', []):
yield content.get('Key')
file_list = ListFiles(client)
for file in file_list:
print 'File found: %s' % file
Using session
from boto3.session import Session
_BUCKET_NAME = 'mybucket'
_PREFIX = 'subfolder/'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
client = session.client('s3')
def ListFilesV1(client, bucket, prefix=''):
"""List files in specific S3 URL"""
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Prefix=prefix,
Delimiter='/'):
for content in result.get('Contents', []):
yield content.get('Key')
file_list = ListFilesV1(client, _BUCKET_NAME, prefix=_PREFIX)
for file in file_list:
print 'File found: %s' % file
The following code will list all the files in specific dir of the S3 bucket:
import boto3
s3 = boto3.client('s3')
def get_all_s3_keys(s3_path):
"""
Get a list of all keys in an S3 bucket.
:param s3_path: Path of S3 dir.
"""
keys = []
if not s3_path.startswith('s3://'):
s3_path = 's3://' + s3_path
bucket = s3_path.split('//')[1].split('/')[0]
prefix = '/'.join(s3_path.split('//')[1].split('/')[1:])
kwargs = {'Bucket': bucket, 'Prefix': prefix}
while True:
resp = s3.list_objects_v2(**kwargs)
for obj in resp['Contents']:
keys.append(obj['Key'])
try:
kwargs['ContinuationToken'] = resp['NextContinuationToken']
except KeyError:
break
return keys
This can be done using:
s3_client = boto3.client('s3')
objects = s3_client.list_objects_v2(Bucket='bucket_name')
for obj in objects['Contents']:
print(obj['Key'])
I just had this same problem, and this code does the trick.
import boto3
s3 = boto3.resource("s3")
s3_bucket = s3.Bucket("bucket-name")
dir = "dir-in-bucket"
files_in_s3 = [f.key.split(dir + "/")[1] for f in
s3_bucket.objects.filter(Prefix=dir).all()]
I have S3 access only to a specific directory in an S3 bucket.
For example, with the s3cmd
command if I try to list the whole bucket:
$ s3cmd ls s3://bucket-name
I get an error: Access to bucket 'my-bucket-url' was denied
But if I try access a specific directory in the bucket, I can see the contents:
$ s3cmd ls s3://bucket-name/dir-in-bucket
Now I want to connect to the S3 bucket with python boto. Similary with:
bucket = conn.get_bucket('bucket-name')
I get an error: boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
But if I try:
bucket = conn.get_bucket('bucket-name/dir-in-bucket')
The script stalls for about 10 seconds, and prints out an error afterwards. Bellow is the full trace. Any idea how to proceed with this?
Note question is about the boto version 2 module, not boto3.
Traceback (most recent call last):
File "test_s3.py", line 7, in <module>
bucket = conn.get_bucket('bucket-name/dir-name')
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 471, in get_bucket
return self.head_bucket(bucket_name, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 490, in head_bucket
response = self.make_request('HEAD', bucket_name, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 633, in make_request
retry_handler=retry_handler
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1046, in make_request
retry_handler=retry_handler)
File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 922, in _mexe
request.body, request.headers)
File "/usr/lib/python2.7/httplib.py", line 958, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 992, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 954, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 814, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 776, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 1157, in connect
self.timeout, self.source_address)
File "/usr/lib/python2.7/socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known
By default, when you do a get_bucket
call in boto it tries to validate that you actually have access to that bucket by performing a HEAD
request on the bucket URL. In this case, you don’t want boto to do that since you don’t have access to the bucket itself. So, do this:
bucket = conn.get_bucket('my-bucket-url', validate=False)
and then you should be able to do something like this to list objects:
for key in bucket.list(prefix='dir-in-bucket'):
<do something>
If you still get a 403 Errror, try adding a slash at the end of the prefix.
for key in bucket.list(prefix='dir-in-bucket/'):
<do something>
Note: this answer was written about the boto version 2 module, which is obsolete by now. At the moment (2020), boto3 is the standard module for working with AWS. See this question for more info: What is the difference between the AWS boto and boto3
If you want to list all the objects of a folder in your bucket, you can specify it while listing.
import boto
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(AWS_BUCKET_NAME)
for file in bucket.list("FOLDER_NAME/", "/"):
<do something with required file>
For boto3
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('my_bucket_name')
for object_summary in my_bucket.objects.filter(Prefix="dir_name/"):
print(object_summary.key)
Boto3 client:
import boto3
_BUCKET_NAME = 'mybucket'
_PREFIX = 'subfolder/'
client = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
def ListFiles(client):
"""List files in specific S3 URL"""
response = client.list_objects(Bucket=_BUCKET_NAME, Prefix=_PREFIX)
for content in response.get('Contents', []):
yield content.get('Key')
file_list = ListFiles(client)
for file in file_list:
print 'File found: %s' % file
Using session
from boto3.session import Session
_BUCKET_NAME = 'mybucket'
_PREFIX = 'subfolder/'
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
client = session.client('s3')
def ListFilesV1(client, bucket, prefix=''):
"""List files in specific S3 URL"""
paginator = client.get_paginator('list_objects')
for result in paginator.paginate(Bucket=bucket, Prefix=prefix,
Delimiter='/'):
for content in result.get('Contents', []):
yield content.get('Key')
file_list = ListFilesV1(client, _BUCKET_NAME, prefix=_PREFIX)
for file in file_list:
print 'File found: %s' % file
The following code will list all the files in specific dir of the S3 bucket:
import boto3
s3 = boto3.client('s3')
def get_all_s3_keys(s3_path):
"""
Get a list of all keys in an S3 bucket.
:param s3_path: Path of S3 dir.
"""
keys = []
if not s3_path.startswith('s3://'):
s3_path = 's3://' + s3_path
bucket = s3_path.split('//')[1].split('/')[0]
prefix = '/'.join(s3_path.split('//')[1].split('/')[1:])
kwargs = {'Bucket': bucket, 'Prefix': prefix}
while True:
resp = s3.list_objects_v2(**kwargs)
for obj in resp['Contents']:
keys.append(obj['Key'])
try:
kwargs['ContinuationToken'] = resp['NextContinuationToken']
except KeyError:
break
return keys
This can be done using:
s3_client = boto3.client('s3')
objects = s3_client.list_objects_v2(Bucket='bucket_name')
for obj in objects['Contents']:
print(obj['Key'])
I just had this same problem, and this code does the trick.
import boto3
s3 = boto3.resource("s3")
s3_bucket = s3.Bucket("bucket-name")
dir = "dir-in-bucket"
files_in_s3 = [f.key.split(dir + "/")[1] for f in
s3_bucket.objects.filter(Prefix=dir).all()]