Read a file line by line from S3 using boto?

Question

I have a csv file in S3 and I’m trying to read the header line to get the size (these files are created by our users so they could be almost any size). Is there a way to do this using boto? I thought maybe I could us a python BufferedReader, but I can’t figure out how to open a stream from an S3 key. Any suggestions would be great. Thanks!

Asked By: gignosko

||

Source

Answer 1

It appears that boto has a read() function that can do this. Here’s some code that works for me:

>>> import boto
>>> from boto.s3.key import Key
>>> conn = boto.connect_s3('ap-southeast-2')
>>> bucket = conn.get_bucket('bucket-name')
>>> k = Key(bucket)
>>> k.key = 'filename.txt'
>>> k.open()
>>> k.read(10)
'This text '

The call to read(n) returns the next n bytes from the object.

Of course, this won’t automatically return “the header line”, but you could call it with a large enough number to return the header line at a minimum.

Answered By: John Rotenstein

Answer 2

You may find https://pypi.python.org/pypi/smart_open useful for your task.

From documentation:

for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
    print line

Answered By: Michael Korbakov

Answer 3

Here’s a solution which actually streams the data line by line:

from io import TextIOWrapper
from gzip import GzipFile
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)

for line in data:
    # process line

Answered By: kooshywoosh

Answer 4

With boto3 you can access a raw stream and read line by line.
Just note raw stream is a private property for some reason

s3 = boto3.resource('s3', aws_access_key_id='xxx', aws_secret_access_key='xxx')
obj = s3.Object('bucket name', 'file key')

obj.get()['Body']._raw_stream.readline() # line 1
obj.get()['Body']._raw_stream.readline() # line 2
obj.get()['Body']._raw_stream.readline() # line 3...

Answered By: robertzp

Answer 5

If you want to read multiple files (line by line) with a specific bucket prefix (i.e., in a “subfolder”) you can do this:

s3 = boto3.resource('s3', aws_access_key_id='<key_id>', aws_secret_access_key='<access_key>')

    bucket = s3.Bucket('<bucket_name>')
    for obj in bucket.objects.filter(Prefix='<your prefix>'):
        for line in obj.get()['Body'].read().splitlines():
            print(line.decode('utf-8'))

Here lines are bytes so I am decoding them; but if they are already a string, you can skip that.

Answered By: oneschilling

Answer 6

The most dynamic and low cost way to read the file is to read each byte until you find the number of lines you need.

line_count = 0
line_data_bytes = b''

while line_count < 2 :

    incoming = correlate_file_obj['Body'].read(1)
    if incoming == b'n':
        line_count = line_count + 1

    line_data_bytes = line_data_bytes + incoming

logger.debug("read bytes:")
logger.debug(line_data_bytes)

line_data = line_data_bytes.split(b'n')

You won’t need to guess about header size if the header size can change, you won’t end up downloading the whole file, and you don’t need 3rd party tools. Granted you need to make sure the line delimeter in your file is correct and you are reading the right number of bytes to find it.

Answered By: KiteCoder

Answer 7

Using boto3:

s3 = boto3.resource('s3')
obj = s3.Object(BUCKET, key)
for line in obj.get()['Body']._raw_stream:
    # do something with line

Answered By: hansaplast

Answer 8

I know it’s a very old question.

But as for now, we can just use s3_conn.get_object(Bucket=bucket, Key=key)['Body'].iter_lines()

Answered By: peon

Answer 9

Expanding on kooshywoosh’s answer: using TextIOWrapper (which is very useful) on a StreamingBody from a plain binary file directly isn’t possible, as you’ll get the following error:

"builtins.AttributeError: 'StreamingBody' object has no attribute 'readable'"

However, you can use the following hack mentioned in this long standing issue on botocore’s github page, and define a very simple wrapper class around StreamingBody:

from io import RawIOBase
...

class StreamingBodyIO(RawIOBase):
"""Wrap a boto StreamingBody in the IOBase API."""
def __init__(self, body):
    self.body = body

def readable(self):
    return True

def read(self, n=-1):
    n = None if n < 0 else n
    return self.body.read(n)

Then, you can simply use the following code:

from io import TextIOWrapper
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
data = TextIOWrapper(StreamingBodyIO(response))
for line in data:
    # process line

Answered By: Dean Gurvitz

Answer 10

The codecs module in the stdlib provides a simple way to encode a stream of bytes into a stream of text and provides a generator to retrieve this text line-by-line. It can be used with S3 without much hassle:

import codecs

import boto3


s3 = boto3.resource("s3")
s3_object = s3.Object('my-bucket', 'a/b/c.txt')
line_stream = codecs.getreader("utf-8")

for line in line_stream(s3_object.get()['Body']):
    print(line)

Answered By: alukach

Read a file line by line from S3 using boto?

Question:

Answers: