Fastest way to move objects within an S3 bucket using boto3

Question:

I need to copy all files from one prefix in S3 to another prefix within the same bucket. My solution is something like:

file_list = [List of files in first prefix]
for file in file_list:
            copy_source = {'Bucket': my_bucket, 'Key': file}
            s3_client.copy(copy_source, my_bucket, new_prefix)

However I am only moving 200 tiny files (1 kb each) and this procedure takes up to 30 seconds. It must be possible to do it fasteer?

Asked By: smallbirds

||

Answers:

So you have a function you need to call on a bunch of things, all of which are independent of each other. You could try multiprocessing.

from multiprocessing import Process

def copy_file(file_name, my_bucket):
    copy_source = {'Bucket': my_bucket, 'Key': file_name}
    s3_client.copy(copy_source, my_bucket, new_prefix)

def main():
    file_list = [...]

    for file_name in file_list:
        p = Process(target=copy_file, args=[file_name, my_bucket])
        p.start()

Then they all can start at (approximately) the same time, instead of having to wait for the last file to complete.

Answered By: Jtcruthers

I would do it in parallel. For example:

from multiprocessing import Pool

file_list = [List of files in first prefix]
    
print(objects_to_download)

def s3_coppier(s3_file):
     copy_source = {'Bucket': my_bucket, 'Key': s3_file}
     s3_client.copy(copy_source, my_bucket, new_prefix)

# copy 5 objects at the same time
with Pool(5) as p:
    p.map(s3_coppier, file_list)
Answered By: Marcin

So I did a small experiment on moving 500 small 1kB files from the same S3 bucket to the same Bucket 3, running from a Lambda (1024 MB ram) in AWS. I did three attempts on each method.

Attempt 1 – Using s3_client.copy:
31 – 32 seconds

Attempt 2 – Using s3_client.copy_opbject:
22 – 23 seconds

Attempt 3 – Using multiprocessing, Pool (the answer above):
19 – 20 seconds

Is it possible to do it even faster?

Answered By: smallbirds

I know its an old post, but maybe someone will get here like I did and wonder whats the most elegant way (IMO) to do it.

awswrangler copy method documentation

If we would use awswrangler PyPi package, we can get good performance and do it in parallel, with zero effort.

It would utilize as much threads as it can, according to what os.cpu_count() returns.

import os
import botocore
import awswrangler as wr
import boto3

S3 = boto3.resource("s3")
bucket_name = os.environ["BUCKET_NAME"]
BUCKET = S3.Bucket(bucket_name)

def copy_from_old_path():
    source_prefix = "some_prefix"
    new_prefix = "some_new_prefix"
    objects = BUCKET.objects.filter(Prefix=source_prefix)
    keys_list = [obj.key for obj in objects]
    bucket_uri = f"s3://{bucket_name}"
    full_paths_list = [f"{bucket_uri}/{key}" for key in keys_list]  # key includes the source_prefix also
    source_path = f"{bucket_uri}/{source_prefix}/"
    target_path = f"{bucket_uri}/{new_prefix}/"
    wr.s3.copy_objects(full_paths_list, source_path, target_path)

if __name__ == "__main__":
    copy_from_old_path()

When running locally from Macbook M1 Pro (32 GB ram) It took me around 20 minutes to copy 24.5 MB of 4,475 parquet files (each is around 7 KB).
Don’t forget to export AWS credentials in CLI before running this, and to export also the environment variable that holds the bucket name.

Answered By: Asaf Buchnick