Fastest way to move objects within an S3 bucket using boto3
Question:
I need to copy all files from one prefix in S3 to another prefix within the same bucket. My solution is something like:
file_list = [List of files in first prefix]
for file in file_list:
copy_source = {'Bucket': my_bucket, 'Key': file}
s3_client.copy(copy_source, my_bucket, new_prefix)
However I am only moving 200 tiny files (1 kb each) and this procedure takes up to 30 seconds. It must be possible to do it fasteer?
Answers:
So you have a function you need to call on a bunch of things, all of which are independent of each other. You could try multiprocessing.
from multiprocessing import Process
def copy_file(file_name, my_bucket):
copy_source = {'Bucket': my_bucket, 'Key': file_name}
s3_client.copy(copy_source, my_bucket, new_prefix)
def main():
file_list = [...]
for file_name in file_list:
p = Process(target=copy_file, args=[file_name, my_bucket])
p.start()
Then they all can start at (approximately) the same time, instead of having to wait for the last file to complete.
I would do it in parallel. For example:
from multiprocessing import Pool
file_list = [List of files in first prefix]
print(objects_to_download)
def s3_coppier(s3_file):
copy_source = {'Bucket': my_bucket, 'Key': s3_file}
s3_client.copy(copy_source, my_bucket, new_prefix)
# copy 5 objects at the same time
with Pool(5) as p:
p.map(s3_coppier, file_list)
So I did a small experiment on moving 500 small 1kB files from the same S3 bucket to the same Bucket 3, running from a Lambda (1024 MB ram) in AWS. I did three attempts on each method.
Attempt 1 – Using s3_client.copy:
31 – 32 seconds
Attempt 2 – Using s3_client.copy_opbject:
22 – 23 seconds
Attempt 3 – Using multiprocessing, Pool (the answer above):
19 – 20 seconds
Is it possible to do it even faster?
I know its an old post, but maybe someone will get here like I did and wonder whats the most elegant way (IMO) to do it.
awswrangler copy method documentation
If we would use awswrangler
PyPi package, we can get good performance and do it in parallel, with zero effort.
It would utilize as much threads as it can, according to what os.cpu_count()
returns.
import os
import botocore
import awswrangler as wr
import boto3
S3 = boto3.resource("s3")
bucket_name = os.environ["BUCKET_NAME"]
BUCKET = S3.Bucket(bucket_name)
def copy_from_old_path():
source_prefix = "some_prefix"
new_prefix = "some_new_prefix"
objects = BUCKET.objects.filter(Prefix=source_prefix)
keys_list = [obj.key for obj in objects]
bucket_uri = f"s3://{bucket_name}"
full_paths_list = [f"{bucket_uri}/{key}" for key in keys_list] # key includes the source_prefix also
source_path = f"{bucket_uri}/{source_prefix}/"
target_path = f"{bucket_uri}/{new_prefix}/"
wr.s3.copy_objects(full_paths_list, source_path, target_path)
if __name__ == "__main__":
copy_from_old_path()
When running locally from Macbook M1 Pro (32 GB ram) It took me around 20 minutes to copy 24.5 MB of 4,475 parquet files (each is around 7 KB).
Don’t forget to export AWS credentials in CLI before running this, and to export also the environment variable that holds the bucket name.
I need to copy all files from one prefix in S3 to another prefix within the same bucket. My solution is something like:
file_list = [List of files in first prefix]
for file in file_list:
copy_source = {'Bucket': my_bucket, 'Key': file}
s3_client.copy(copy_source, my_bucket, new_prefix)
However I am only moving 200 tiny files (1 kb each) and this procedure takes up to 30 seconds. It must be possible to do it fasteer?
So you have a function you need to call on a bunch of things, all of which are independent of each other. You could try multiprocessing.
from multiprocessing import Process
def copy_file(file_name, my_bucket):
copy_source = {'Bucket': my_bucket, 'Key': file_name}
s3_client.copy(copy_source, my_bucket, new_prefix)
def main():
file_list = [...]
for file_name in file_list:
p = Process(target=copy_file, args=[file_name, my_bucket])
p.start()
Then they all can start at (approximately) the same time, instead of having to wait for the last file to complete.
I would do it in parallel. For example:
from multiprocessing import Pool
file_list = [List of files in first prefix]
print(objects_to_download)
def s3_coppier(s3_file):
copy_source = {'Bucket': my_bucket, 'Key': s3_file}
s3_client.copy(copy_source, my_bucket, new_prefix)
# copy 5 objects at the same time
with Pool(5) as p:
p.map(s3_coppier, file_list)
So I did a small experiment on moving 500 small 1kB files from the same S3 bucket to the same Bucket 3, running from a Lambda (1024 MB ram) in AWS. I did three attempts on each method.
Attempt 1 – Using s3_client.copy:
31 – 32 seconds
Attempt 2 – Using s3_client.copy_opbject:
22 – 23 seconds
Attempt 3 – Using multiprocessing, Pool (the answer above):
19 – 20 seconds
Is it possible to do it even faster?
I know its an old post, but maybe someone will get here like I did and wonder whats the most elegant way (IMO) to do it.
awswrangler copy method documentation
If we would use awswrangler
PyPi package, we can get good performance and do it in parallel, with zero effort.
It would utilize as much threads as it can, according to what os.cpu_count()
returns.
import os
import botocore
import awswrangler as wr
import boto3
S3 = boto3.resource("s3")
bucket_name = os.environ["BUCKET_NAME"]
BUCKET = S3.Bucket(bucket_name)
def copy_from_old_path():
source_prefix = "some_prefix"
new_prefix = "some_new_prefix"
objects = BUCKET.objects.filter(Prefix=source_prefix)
keys_list = [obj.key for obj in objects]
bucket_uri = f"s3://{bucket_name}"
full_paths_list = [f"{bucket_uri}/{key}" for key in keys_list] # key includes the source_prefix also
source_path = f"{bucket_uri}/{source_prefix}/"
target_path = f"{bucket_uri}/{new_prefix}/"
wr.s3.copy_objects(full_paths_list, source_path, target_path)
if __name__ == "__main__":
copy_from_old_path()
When running locally from Macbook M1 Pro (32 GB ram) It took me around 20 minutes to copy 24.5 MB of 4,475 parquet files (each is around 7 KB).
Don’t forget to export AWS credentials in CLI before running this, and to export also the environment variable that holds the bucket name.