Read file content from S3 bucket with boto3
Question:
I read the filenames in my S3 bucket by doing
objs = boto3.client.list_objects(Bucket='my_bucket')
while 'Contents' in objs.keys():
objs_contents = objs['Contents']
for i in range(len(objs_contents)):
filename = objs_contents[i]['Key']
Now, I need to get the actual content of the file, similarly to a open(filename).readlines()
. What is the best way?
Answers:
boto3 offers a resource model that makes tasks like iterating through objects easier. Unfortunately, StreamingBody doesn’t provide readline
or readlines
.
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_read(s3path)
directly or the copy-pasted code:
def s3_read(source, profile_name=None):
"""
Read a file from an S3 source.
Parameters
----------
source : str
Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
profile_name : str, optional
AWS profile
Returns
-------
content : bytes
botocore.exceptions.NoCredentialsError
Botocore is not able to find your credentials. Either specify
profile_name or add the environment variables AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
See https://boto3.readthedocs.io/en/latest/guide/configuration.html
"""
session = boto3.Session(profile_name=profile_name)
s3 = session.client('s3')
bucket_name, key = mpu.aws._s3_path_split(source)
s3_object = s3.get_object(Bucket=bucket_name, Key=key)
body = s3_object['Body']
return body.read()
You might consider the smart_open
module, which supports iterators:
from smart_open import smart_open
# stream lines from an S3 object
for line in smart_open('s3://mybucket/mykey.txt', 'rb'):
print(line.decode('utf8'))
and context managers:
with smart_open('s3://mybucket/mykey.txt', 'rb') as s3_source:
for line in s3_source:
print(line.decode('utf8'))
s3_source.seek(0) # seek to the beginning
b1000 = s3_source.read(1000) # read 1000 bytes
Find smart_open
at https://pypi.org/project/smart_open/
If you already know the filename
, you can use the boto3
builtin download_fileobj
import boto3
from io import BytesIO
session = boto3.Session()
s3_client = session.client("s3")
f = BytesIO()
s3_client.download_fileobj("bucket_name", "filename", f)
print(f.getvalue())
Using the client instead of resource:
s3 = boto3.client('s3')
bucket='bucket_name'
result = s3.list_objects(Bucket = bucket, Prefix='/something/')
for o in result.get('Contents'):
data = s3.get_object(Bucket=bucket, Key=o.get('Key'))
contents = data['Body'].read()
print(contents.decode("utf-8"))
the best way for me is this:
result = s3.list_objects(Bucket = s3_bucket, Prefix=s3_key)
for file in result.get('Contents'):
data = s3.get_object(Bucket=s3_bucket, Key=file.get('Key'))
contents = data['Body'].read()
#if Float types are not supported with dynamodb; use Decimal types instead
j = json.loads(contents, parse_float=Decimal)
for item in j:
timestamp = item['timestamp']
table.put_item(
Item={
'timestamp': timestamp
}
)
once you have the content you can run it through another loop to write it to a dynamodb table for instance …
import boto3
print("started")
s3 = boto3.resource('s3',region_name='region_name', aws_access_key_id='your_access_id', aws_secret_access_key='your access key')
obj = s3.Object('bucket_name','file_name')
data=obj.get()['Body'].read()
print(data)
This is the correct and tested code to access the file contents using boto3 from the s3 bucket. It is working for me till the date of posting.
def get_file_contents(bucket, prefix):
s3 = boto3.resource('s3')
s3.meta.client.meta.events.register('choose-signer.s3.*', disable_signing)
bucket = s3.Bucket(bucket)
for obj in bucket.objects.filter(Prefix=prefix):
key = obj.key
body = obj.get()['Body'].read()
print(body)
return body
get_file_contents('coderbytechallengesandbox', '__cb__')
An alternative to boto3 in this particular case is s3fs.
from s3fs import S3FileSystem
s3 = S3FileSystem()
bucket = 's3://your-bucket'
def read_file(key):
with s3.open(f'{s3_path}/{key}', 'r') as file: # s3://bucket/file.txt
return file.readlines()
for obj in bucket.objects.all():
key = obj.key
lines = read_file(key)
...
Please note that Boto3 now stopped updates to Resources
and the recommend approach now is to go back using Client
.
So, I believe answer from @Climbs_lika_Spyder should now be the accepted answer.
Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html
Warning: The AWS Python SDK team is no longer planning to support the resources interface in boto3. Requests for new changes involving resource models will no longer be considered, and the resources interface won’t be supported in the next major version of the AWS SDK for Python. The AWS SDK teams are striving to achieve more consistent functionality among SDKs, and implementing customized abstractions in individual SDKs is not a sustainable solution going forward. Future feature requests will need to be considered at the cross-SDK level.
I read the filenames in my S3 bucket by doing
objs = boto3.client.list_objects(Bucket='my_bucket')
while 'Contents' in objs.keys():
objs_contents = objs['Contents']
for i in range(len(objs_contents)):
filename = objs_contents[i]['Key']
Now, I need to get the actual content of the file, similarly to a open(filename).readlines()
. What is the best way?
boto3 offers a resource model that makes tasks like iterating through objects easier. Unfortunately, StreamingBody doesn’t provide readline
or readlines
.
s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
for obj in bucket.objects.all():
key = obj.key
body = obj.get()['Body'].read()
When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_read(s3path)
directly or the copy-pasted code:
def s3_read(source, profile_name=None):
"""
Read a file from an S3 source.
Parameters
----------
source : str
Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
profile_name : str, optional
AWS profile
Returns
-------
content : bytes
botocore.exceptions.NoCredentialsError
Botocore is not able to find your credentials. Either specify
profile_name or add the environment variables AWS_ACCESS_KEY_ID,
AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
See https://boto3.readthedocs.io/en/latest/guide/configuration.html
"""
session = boto3.Session(profile_name=profile_name)
s3 = session.client('s3')
bucket_name, key = mpu.aws._s3_path_split(source)
s3_object = s3.get_object(Bucket=bucket_name, Key=key)
body = s3_object['Body']
return body.read()
You might consider the smart_open
module, which supports iterators:
from smart_open import smart_open
# stream lines from an S3 object
for line in smart_open('s3://mybucket/mykey.txt', 'rb'):
print(line.decode('utf8'))
and context managers:
with smart_open('s3://mybucket/mykey.txt', 'rb') as s3_source:
for line in s3_source:
print(line.decode('utf8'))
s3_source.seek(0) # seek to the beginning
b1000 = s3_source.read(1000) # read 1000 bytes
Find smart_open
at https://pypi.org/project/smart_open/
If you already know the filename
, you can use the boto3
builtin download_fileobj
import boto3
from io import BytesIO
session = boto3.Session()
s3_client = session.client("s3")
f = BytesIO()
s3_client.download_fileobj("bucket_name", "filename", f)
print(f.getvalue())
Using the client instead of resource:
s3 = boto3.client('s3')
bucket='bucket_name'
result = s3.list_objects(Bucket = bucket, Prefix='/something/')
for o in result.get('Contents'):
data = s3.get_object(Bucket=bucket, Key=o.get('Key'))
contents = data['Body'].read()
print(contents.decode("utf-8"))
the best way for me is this:
result = s3.list_objects(Bucket = s3_bucket, Prefix=s3_key)
for file in result.get('Contents'):
data = s3.get_object(Bucket=s3_bucket, Key=file.get('Key'))
contents = data['Body'].read()
#if Float types are not supported with dynamodb; use Decimal types instead
j = json.loads(contents, parse_float=Decimal)
for item in j:
timestamp = item['timestamp']
table.put_item(
Item={
'timestamp': timestamp
}
)
once you have the content you can run it through another loop to write it to a dynamodb table for instance …
import boto3
print("started")
s3 = boto3.resource('s3',region_name='region_name', aws_access_key_id='your_access_id', aws_secret_access_key='your access key')
obj = s3.Object('bucket_name','file_name')
data=obj.get()['Body'].read()
print(data)
This is the correct and tested code to access the file contents using boto3 from the s3 bucket. It is working for me till the date of posting.
def get_file_contents(bucket, prefix):
s3 = boto3.resource('s3')
s3.meta.client.meta.events.register('choose-signer.s3.*', disable_signing)
bucket = s3.Bucket(bucket)
for obj in bucket.objects.filter(Prefix=prefix):
key = obj.key
body = obj.get()['Body'].read()
print(body)
return body
get_file_contents('coderbytechallengesandbox', '__cb__')
An alternative to boto3 in this particular case is s3fs.
from s3fs import S3FileSystem
s3 = S3FileSystem()
bucket = 's3://your-bucket'
def read_file(key):
with s3.open(f'{s3_path}/{key}', 'r') as file: # s3://bucket/file.txt
return file.readlines()
for obj in bucket.objects.all():
key = obj.key
lines = read_file(key)
...
Please note that Boto3 now stopped updates to Resources
and the recommend approach now is to go back using Client
.
So, I believe answer from @Climbs_lika_Spyder should now be the accepted answer.
Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html
Warning: The AWS Python SDK team is no longer planning to support the resources interface in boto3. Requests for new changes involving resource models will no longer be considered, and the resources interface won’t be supported in the next major version of the AWS SDK for Python. The AWS SDK teams are striving to achieve more consistent functionality among SDKs, and implementing customized abstractions in individual SDKs is not a sustainable solution going forward. Future feature requests will need to be considered at the cross-SDK level.