Read a file line by line from S3 using boto?

Question:

I have a csv file in S3 and I’m trying to read the header line to get the size (these files are created by our users so they could be almost any size). Is there a way to do this using boto? I thought maybe I could us a python BufferedReader, but I can’t figure out how to open a stream from an S3 key. Any suggestions would be great. Thanks!

Asked By: gignosko

||

Answers:

It appears that boto has a read() function that can do this. Here’s some code that works for me:

>>> import boto
>>> from boto.s3.key import Key
>>> conn = boto.connect_s3('ap-southeast-2')
>>> bucket = conn.get_bucket('bucket-name')
>>> k = Key(bucket)
>>> k.key = 'filename.txt'
>>> k.open()
>>> k.read(10)
'This text '

The call to read(n) returns the next n bytes from the object.

Of course, this won’t automatically return “the header line”, but you could call it with a large enough number to return the header line at a minimum.

Answered By: John Rotenstein

You may find https://pypi.python.org/pypi/smart_open useful for your task.

From documentation:

for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
    print line
Answered By: Michael Korbakov

Here’s a solution which actually streams the data line by line:

from io import TextIOWrapper
from gzip import GzipFile
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)

for line in data:
    # process line
Answered By: kooshywoosh

With boto3 you can access a raw stream and read line by line.
Just note raw stream is a private property for some reason

s3 = boto3.resource('s3', aws_access_key_id='xxx', aws_secret_access_key='xxx')
obj = s3.Object('bucket name', 'file key')

obj.get()['Body']._raw_stream.readline() # line 1
obj.get()['Body']._raw_stream.readline() # line 2
obj.get()['Body']._raw_stream.readline() # line 3...
Answered By: robertzp

If you want to read multiple files (line by line) with a specific bucket prefix (i.e., in a “subfolder”) you can do this:

s3 = boto3.resource('s3', aws_access_key_id='<key_id>', aws_secret_access_key='<access_key>')

    bucket = s3.Bucket('<bucket_name>')
    for obj in bucket.objects.filter(Prefix='<your prefix>'):
        for line in obj.get()['Body'].read().splitlines():
            print(line.decode('utf-8'))

Here lines are bytes so I am decoding them; but if they are already a string, you can skip that.

Answered By: oneschilling

The most dynamic and low cost way to read the file is to read each byte until you find the number of lines you need.

line_count = 0
line_data_bytes = b''

while line_count < 2 :

    incoming = correlate_file_obj['Body'].read(1)
    if incoming == b'n':
        line_count = line_count + 1

    line_data_bytes = line_data_bytes + incoming

logger.debug("read bytes:")
logger.debug(line_data_bytes)

line_data = line_data_bytes.split(b'n')

You won’t need to guess about header size if the header size can change, you won’t end up downloading the whole file, and you don’t need 3rd party tools. Granted you need to make sure the line delimeter in your file is correct and you are reading the right number of bytes to find it.

Answered By: KiteCoder

Using boto3:

s3 = boto3.resource('s3')
obj = s3.Object(BUCKET, key)
for line in obj.get()['Body']._raw_stream:
    # do something with line
Answered By: hansaplast

I know it’s a very old question.

But as for now, we can just use s3_conn.get_object(Bucket=bucket, Key=key)['Body'].iter_lines()

Answered By: peon

Expanding on kooshywoosh’s answer: using TextIOWrapper (which is very useful) on a StreamingBody from a plain binary file directly isn’t possible, as you’ll get the following error:

"builtins.AttributeError: 'StreamingBody' object has no attribute 'readable'"

However, you can use the following hack mentioned in this long standing issue on botocore’s github page, and define a very simple wrapper class around StreamingBody:

from io import RawIOBase
...

class StreamingBodyIO(RawIOBase):
"""Wrap a boto StreamingBody in the IOBase API."""
def __init__(self, body):
    self.body = body

def readable(self):
    return True

def read(self, n=-1):
    n = None if n < 0 else n
    return self.body.read(n)

Then, you can simply use the following code:

from io import TextIOWrapper
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
data = TextIOWrapper(StreamingBodyIO(response))
for line in data:
    # process line
Answered By: Dean Gurvitz

The codecs module in the stdlib provides a simple way to encode a stream of bytes into a stream of text and provides a generator to retrieve this text line-by-line. It can be used with S3 without much hassle:

import codecs

import boto3


s3 = boto3.resource("s3")
s3_object = s3.Object('my-bucket', 'a/b/c.txt')
line_stream = codecs.getreader("utf-8")

for line in line_stream(s3_object.get()['Body']):
    print(line)
Answered By: alukach