Read file content from S3 bucket with boto3

Question:

I read the filenames in my S3 bucket by doing

objs = boto3.client.list_objects(Bucket='my_bucket')
    while 'Contents' in objs.keys():
        objs_contents = objs['Contents']
        for i in range(len(objs_contents)):
            filename = objs_contents[i]['Key']

Now, I need to get the actual content of the file, similarly to a open(filename).readlines(). What is the best way?

Asked By: mar tin

||

Answers:

boto3 offers a resource model that makes tasks like iterating through objects easier. Unfortunately, StreamingBody doesn’t provide readline or readlines.

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()
Answered By: Jordon Phillips

When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_read(s3path) directly or the copy-pasted code:

def s3_read(source, profile_name=None):
    """
    Read a file from an S3 source.

    Parameters
    ----------
    source : str
        Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
    profile_name : str, optional
        AWS profile

    Returns
    -------
    content : bytes

    botocore.exceptions.NoCredentialsError
        Botocore is not able to find your credentials. Either specify
        profile_name or add the environment variables AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
        See https://boto3.readthedocs.io/en/latest/guide/configuration.html
    """
    session = boto3.Session(profile_name=profile_name)
    s3 = session.client('s3')
    bucket_name, key = mpu.aws._s3_path_split(source)
    s3_object = s3.get_object(Bucket=bucket_name, Key=key)
    body = s3_object['Body']
    return body.read()
Answered By: Martin Thoma

You might consider the smart_open module, which supports iterators:

from smart_open import smart_open

# stream lines from an S3 object
for line in smart_open('s3://mybucket/mykey.txt', 'rb'):
    print(line.decode('utf8'))

and context managers:

with smart_open('s3://mybucket/mykey.txt', 'rb') as s3_source:
    for line in s3_source:
         print(line.decode('utf8'))

    s3_source.seek(0)  # seek to the beginning
    b1000 = s3_source.read(1000)  # read 1000 bytes

Find smart_open at https://pypi.org/project/smart_open/

Answered By: caffreyd

If you already know the filename, you can use the boto3 builtin download_fileobj

import boto3

from io import BytesIO

session = boto3.Session()
s3_client = session.client("s3")

f = BytesIO()
s3_client.download_fileobj("bucket_name", "filename", f)
print(f.getvalue())
Answered By: reubano

Using the client instead of resource:

s3 = boto3.client('s3')
bucket='bucket_name'
result = s3.list_objects(Bucket = bucket, Prefix='/something/')
for o in result.get('Contents'):
    data = s3.get_object(Bucket=bucket, Key=o.get('Key'))
    contents = data['Body'].read()
    print(contents.decode("utf-8"))
Answered By: Climbs_lika_Spyder

the best way for me is this:

result = s3.list_objects(Bucket = s3_bucket, Prefix=s3_key)
for file in result.get('Contents'):
    data = s3.get_object(Bucket=s3_bucket, Key=file.get('Key'))
    contents = data['Body'].read()
    #if Float types are not supported with dynamodb; use Decimal types instead
    j = json.loads(contents, parse_float=Decimal)
    for item in j:
       timestamp = item['timestamp']

       table.put_item(
           Item={
            'timestamp': timestamp
           }
      )

once you have the content you can run it through another loop to write it to a dynamodb table for instance …

Answered By: aerioeus
import boto3

print("started")

s3 = boto3.resource('s3',region_name='region_name', aws_access_key_id='your_access_id', aws_secret_access_key='your access key')

obj = s3.Object('bucket_name','file_name')

data=obj.get()['Body'].read()

print(data)
Answered By: Jagadeesh P

This is the correct and tested code to access the file contents using boto3 from the s3 bucket. It is working for me till the date of posting.

def get_file_contents(bucket, prefix):
    s3 = boto3.resource('s3')
    s3.meta.client.meta.events.register('choose-signer.s3.*', disable_signing)
    bucket = s3.Bucket(bucket)
    for obj in bucket.objects.filter(Prefix=prefix):
        key = obj.key
        body = obj.get()['Body'].read()
        print(body)
        return body

get_file_contents('coderbytechallengesandbox', '__cb__')
Answered By: bilalmohib

An alternative to boto3 in this particular case is s3fs.

from s3fs import S3FileSystem
s3 = S3FileSystem()
bucket = 's3://your-bucket'

def read_file(key):
    with s3.open(f'{s3_path}/{key}', 'r') as file:  # s3://bucket/file.txt
        return file.readlines()

for obj in bucket.objects.all():
    key = obj.key
    lines = read_file(key)
    ...
Answered By: Manuel Montoya

Please note that Boto3 now stopped updates to Resources and the recommend approach now is to go back using Client.

So, I believe answer from @Climbs_lika_Spyder should now be the accepted answer.

Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html

Warning: The AWS Python SDK team is no longer planning to support the resources interface in boto3. Requests for new changes involving resource models will no longer be considered, and the resources interface won’t be supported in the next major version of the AWS SDK for Python. The AWS SDK teams are striving to achieve more consistent functionality among SDKs, and implementing customized abstractions in individual SDKs is not a sustainable solution going forward. Future feature requests will need to be considered at the cross-SDK level.

Answered By: Code-ReCode