How can I copy files bigger than 5 GB in Amazon S3?
Question:
Amazon S3 REST API documentation says there’s a size limit of 5gb for upload in a PUT operation. Files bigger than that have to be uploaded using multipart. Fine.
However, what I need in essence is to rename files that might be bigger than that. As far as I know there’s no rename or move operation, therefore I have to copy the file to the new location and delete the old one. How exactly that is done with files bigger than 5gb? I have to do a multipart upload from the bucket to itself? In that case, how splitting the file in parts work?
From reading boto’s source it doesn’t seem like it does anything like this automatically for files bigger than 5gb. Is there any built-in support that I missed?
Answers:
As far as I know there’s no rename or move operation, therefore I have
to copy the file to the new location and delete the old one.
That’s correct, it’s pretty easy to do for objects/files smaller than 5 GB by means of a PUT Object – Copy operation, followed by a DELETE Object operation (both of which are supported in boto of course, see copy_key() and delete_key()):
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
However, that’s indeed not possible for objects/files greater than 5 GB:
Note
[…] You create a copy of your object up to 5 GB in size in a single atomic
operation using this API. However, for copying an object greater than
5 GB, you must use the multipart upload API. For conceptual
information […], go to Uploading Objects Using Multipart Upload […] [emphasis mine]
Boto meanwhile supports this as well by means of the copy_part_from_key() method; unfortunately the required approach isn’t documented outside of the respective pull request #425 (allow for multi-part copy commands) (I haven’t tried this myself yet though):
import boto
s3 = boto.connect_s3('access', 'secret')
b = s3.get_bucket('destination_bucket')
mp = b.initiate_multipart_upload('tmp/large-copy-test.mp4')
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 1, 0, 999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 2, 1000000000, 1999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 3, 2000000000, 2999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 4, 3000000000, 3999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 5, 4000000000, 4999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 6, 5000000000, 5500345712)
mp.complete_upload()
You might want to study the respective samples on how to achieve this in Java or .NET eventually, which might provide more insight into the general approach, see Copying Objects Using the Multipart Upload API.
Good luck!
Appendix
Please be aware of the following peculiarity regarding copying in general, which is easily overlooked:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request. To override the
default ACL setting, use the x-amz-acl header to specify a new ACL
when generating a copy request. For more information, see Amazon S3
ACLs. [emphasis mine]
The above was very close to working, unfortunately should have ended with mp.complete_upload()
instead of the typo upload_complete()
!
I’ve added a working boto s3 multipart copy script here, based of the AWS Java example and tested with files over 5 GiB:
I found this method to upload files bigger than 5gigs and modified it to work with a Boto copy procedure.
here’s the original: http://boto.cloudhackers.com/en/latest/s3_tut.html
import math
from boto.s3.connection import S3Connection
from boto.exception import S3ResponseError
conn = S3Connection(host=[your_host], aws_access_key_id=[your_access_key],
aws_secret_access_key=[your_secret_access_key])
from_bucket = conn.get_bucket('your_from_bucket_name')
key = from_bucket.lookup('my_key_name')
dest_bucket = conn.get_bucket('your_to_bucket_name')
total_bytes = key.size
bytes_per_chunk = 500000000
chunks_count = int(math.ceil(total_bytes/float(bytes_per_chunk)))
file_upload = dest_bucket.initiate_multipart_upload(key.name)
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = total_bytes - offset
print(str(remaining_bytes))
next_byte_chunk = min([bytes_per_chunk, remaining_bytes])
part_number = i + 1
file_upload.copy_part_from_key(dest_bucket.name, key.name, part_number,
offset, offset + next_byte_chunk - 1)
file_upload.complete_upload()
The now standard .copy
method will perform multipart uploads for files larger than 5gb. Official Docs
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')
Amazon S3 REST API documentation says there’s a size limit of 5gb for upload in a PUT operation. Files bigger than that have to be uploaded using multipart. Fine.
However, what I need in essence is to rename files that might be bigger than that. As far as I know there’s no rename or move operation, therefore I have to copy the file to the new location and delete the old one. How exactly that is done with files bigger than 5gb? I have to do a multipart upload from the bucket to itself? In that case, how splitting the file in parts work?
From reading boto’s source it doesn’t seem like it does anything like this automatically for files bigger than 5gb. Is there any built-in support that I missed?
As far as I know there’s no rename or move operation, therefore I have
to copy the file to the new location and delete the old one.
That’s correct, it’s pretty easy to do for objects/files smaller than 5 GB by means of a PUT Object – Copy operation, followed by a DELETE Object operation (both of which are supported in boto of course, see copy_key() and delete_key()):
This implementation of the PUT operation creates a copy of an object
that is already stored in Amazon S3. A PUT copy operation is the same
as performing a GET and then a PUT. Adding the request header,
x-amz-copy-source, makes the PUT operation copy the source object into
the destination bucket.
However, that’s indeed not possible for objects/files greater than 5 GB:
Note
[…] You create a copy of your object up to 5 GB in size in a single atomic
operation using this API. However, for copying an object greater than
5 GB, you must use the multipart upload API. For conceptual
information […], go to Uploading Objects Using Multipart Upload […] [emphasis mine]
Boto meanwhile supports this as well by means of the copy_part_from_key() method; unfortunately the required approach isn’t documented outside of the respective pull request #425 (allow for multi-part copy commands) (I haven’t tried this myself yet though):
import boto
s3 = boto.connect_s3('access', 'secret')
b = s3.get_bucket('destination_bucket')
mp = b.initiate_multipart_upload('tmp/large-copy-test.mp4')
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 1, 0, 999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 2, 1000000000, 1999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 3, 2000000000, 2999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 4, 3000000000, 3999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 5, 4000000000, 4999999999)
mp.copy_part_from_key('source_bucket', 'path/to/source/key', 6, 5000000000, 5500345712)
mp.complete_upload()
You might want to study the respective samples on how to achieve this in Java or .NET eventually, which might provide more insight into the general approach, see Copying Objects Using the Multipart Upload API.
Good luck!
Appendix
Please be aware of the following peculiarity regarding copying in general, which is easily overlooked:
When copying an object, you can preserve most of the metadata
(default) or specify new metadata. However, the ACL is not preserved
and is set to private for the user making the request. To override the
default ACL setting, use the x-amz-acl header to specify a new ACL
when generating a copy request. For more information, see Amazon S3
ACLs. [emphasis mine]
The above was very close to working, unfortunately should have ended with mp.complete_upload()
instead of the typo upload_complete()
!
I’ve added a working boto s3 multipart copy script here, based of the AWS Java example and tested with files over 5 GiB:
I found this method to upload files bigger than 5gigs and modified it to work with a Boto copy procedure.
here’s the original: http://boto.cloudhackers.com/en/latest/s3_tut.html
import math
from boto.s3.connection import S3Connection
from boto.exception import S3ResponseError
conn = S3Connection(host=[your_host], aws_access_key_id=[your_access_key],
aws_secret_access_key=[your_secret_access_key])
from_bucket = conn.get_bucket('your_from_bucket_name')
key = from_bucket.lookup('my_key_name')
dest_bucket = conn.get_bucket('your_to_bucket_name')
total_bytes = key.size
bytes_per_chunk = 500000000
chunks_count = int(math.ceil(total_bytes/float(bytes_per_chunk)))
file_upload = dest_bucket.initiate_multipart_upload(key.name)
for i in range(chunks_count):
offset = i * bytes_per_chunk
remaining_bytes = total_bytes - offset
print(str(remaining_bytes))
next_byte_chunk = min([bytes_per_chunk, remaining_bytes])
part_number = i + 1
file_upload.copy_part_from_key(dest_bucket.name, key.name, part_number,
offset, offset + next_byte_chunk - 1)
file_upload.complete_upload()
The now standard .copy
method will perform multipart uploads for files larger than 5gb. Official Docs
import boto3
s3 = boto3.resource('s3')
copy_source = {
'Bucket': 'mybucket',
'Key': 'mykey'
}
s3.meta.client.copy(copy_source, 'otherbucket', 'otherkey')