Concatenating Multiple Objects into a single Pandas Dataframe with AWS S3 Bucket

Question:

I am trying to use a function I found from this previous question Reading multiple csv files from S3 bucket with boto3
But I keep getting ValueError: DataFrame constructor not properly called!

This is the code below:

s3 = boto3.resource('s3',aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)
bucket = s3.Bucket('test_bucket')
prefix_objs = bucket.objects.filter(Prefix=prefix)
prefix_df = []
for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()
    df = pd.DataFrame(body)

When I print body all I get is a bunch of string starting with a b’

Asked By: TH14

||

Answers:

I use this and it works well if all your files are in 1 prefix path. Basically you create the s3 client then iterate over each object in the prefix path followed by appending each file to an empty list for the concatenation via pandas.

import boto3
import pandas as pd

s3 = boto3.client("s3",
                  region_name=region_name,
                  aws_access_key_id=aws_access_key_id,
                  aws_secret_access_key=aws_secret_access_key)

response = s3.list_objects(Bucket="my-bucket",
                           Prefix="datasets/")

df_list = []

for file in response["Contents"]:
    obj = s3.get_object(Bucket="my-bucket", Key=file["Key"])
    obj_df = pd.read_csv(obj["Body"])
    df_list.append(obj_df)

df = pd.concat(df_list)
Answered By: thePurplePython

If you install s3fs and fsspec, you can directly read with pd.read_csv to the s3 location, which is much faster than using s3.get_object:

import boto3
import pandas as pd

s3 = boto3.client("s3",
                  region_name=region_name,
                  aws_access_key_id=aws_access_key_id,
                  aws_secret_access_key=aws_secret_access_key)

response = s3.list_objects(Bucket="my-bucket", Prefix="datasets/")

df = pd.concat([pd.read_csv(f"s3://my-bucket/{file['Key']}") for file in response['Contents']])
Answered By: ronkov
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.