Sync two buckets through boto3

Question:

Is there any way to use boto3 to loop the bucket contents in two different buckets (source and target) and if it finds any key in source that does not match with target, it uploads it to the target bucket. please note I do not want to use aws s3 sync. I am currently using the following code for doing this job:

import boto3

s3 = boto3.resource('s3')
src = s3.Bucket('sourcenabcap')
dst = s3.Bucket('destinationnabcap')
objs = list(dst.objects.all())
for k in src.objects.all():
 if (k.key !=objs[0].key):
  # copy the k.key to target
Asked By: milad ahmadi

||

Answers:

If you only wish to compare by Key (ignoring differences within objects), you could use something like:

s3 = boto3.resource('s3')
source_bucket = s3.Bucket('source')
destination_bucket = s3.Bucket('destination')
destination_keys = [object.key for object in destination_bucket.objects.all()]
for object in source_bucket.objects.all():
  if (object.key not in destination_keys):
    # copy object.key to destination
Answered By: John Rotenstein

in case you decide not to use boto3. the sync command is still not available for boto3, so you could use it directly

# python 3

import os

sync_command = f"aws s3 sync s3://source-bucket/ s3://destination-bucket/"
os.system(sync_command)
Answered By: anyavacy

I’ve just implemented a simple class for this matter (sync a local folder to a bucket). I’m posting it here hoping it help anyone with the same issue.

You could modify S3Sync.sync in order to take file size into account.

from pathlib import Path
from bisect import bisect_left

import boto3


class S3Sync:
    """
    Class that holds the operations needed for synchronize local dirs to a given bucket.
    """

    def __init__(self):
        self._s3 = boto3.client('s3')

    def sync(self, source: str, dest: str) -> [str]:
        """
        Sync source to dest, this means that all elements existing in
        source that not exists in dest will be copied to dest.

        No element will be deleted.

        :param source: Source folder.
        :param dest: Destination folder.

        :return: None
        """

        paths = self.list_source_objects(source_folder=source)
        objects = self.list_bucket_objects(dest)

        # Getting the keys and ordering to perform binary search
        # each time we want to check if any paths is already there.
        object_keys = [obj['Key'] for obj in objects]
        object_keys.sort()
        object_keys_length = len(object_keys)
        
        for path in paths:
            # Binary search.
            index = bisect_left(object_keys, path)
            if index == object_keys_length:
                # If path not found in object_keys, it has to be sync-ed.
                self._s3.upload_file(str(Path(source).joinpath(path)),  Bucket=dest, Key=path)

    def list_bucket_objects(self, bucket: str) -> [dict]:
        """
        List all objects for the given bucket.

        :param bucket: Bucket name.
        :return: A [dict] containing the elements in the bucket.

        Example of a single object.

        {
            'Key': 'example/example.txt',
            'LastModified': datetime.datetime(2019, 7, 4, 13, 50, 34, 893000, tzinfo=tzutc()),
            'ETag': '"b11564415be7f58435013b414a59ae5c"',
            'Size': 115280,
            'StorageClass': 'STANDARD',
            'Owner': {
                'DisplayName': 'webfile',
                'ID': '75aa57f09aa0c8caeab4f8c24e99d10f8e7faeebf76c078efc7c6caea54ba06a'
            }
        }

        """
        try:
            contents = self._s3.list_objects(Bucket=bucket)['Contents']
        except KeyError:
            # No Contents Key, empty bucket.
            return []
        else:
            return contents

    @staticmethod
    def list_source_objects(source_folder: str) -> [str]:
        """
        :param source_folder:  Root folder for resources you want to list.
        :return: A [str] containing relative names of the files.

        Example:

            /tmp
                - example
                    - file_1.txt
                    - some_folder
                        - file_2.txt

            >>> sync.list_source_objects("/tmp/example")
            ['file_1.txt', 'some_folder/file_2.txt']

        """

        path = Path(source_folder)

        paths = []

        for file_path in path.rglob("*"):
            if file_path.is_dir():
                continue
            str_file_path = str(file_path)
            str_file_path = str_file_path.replace(f'{str(path)}/', "")
            paths.append(str_file_path)

        return paths


if __name__ == '__main__':
    sync = S3Sync()
    sync.sync("/temp/some_folder", "some_bucket_name")

Also replacing if file_path.is_dir(): with if not file_path.is_file(): lets it bypass links that don’t resolve and other such nonsense, thanks @keithpjolley for pointing this out.

Answered By: Raydel Miranda
  1. Get destination account ID DEST_ACCOUNT_ID

  2. Create source bucket and add this policy

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "DelegateS3Access",
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::DEST_ACCOUNT_ID:root"
                },
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::s3-copy-test/*",
                    "arn:aws:s3:::s3-copy-test"
                ]
            }
        ]
    }

  1. Create files to be copied

  2. Create user on destination account and configure AWS CLI with this user

  3. Create destination bucket on destination account

  4. Attach this policy to IAM user on destination account

     {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Action": [
                 "s3:ListBucket",
                 "s3:GetObject"
             ],
             "Resource": [
                 "arn:aws:s3:::s3-copy-test",
                 "arn:aws:s3:::s3-copy-test/*"
             ]
         },
         {
             "Effect": "Allow",
             "Action": [
                 "s3:ListBucket",
                 "s3:PutObject",
                 "s3:PutObjectAcl"
             ],
             "Resource": [
                 "arn:aws:s3:::s3-copy-test-dest",
                 "arn:aws:s3:::s3-copy-test-dest/*"
             ]
         }
     ]
    

    }

  5. execute file sync

aws s3 sync s3://s3-copy-test s3://s3-copy-test-dest --source-region eu-west-1 --region eu-west-1
Answered By: Dagm Fekadu

I implemented a class also similar idea to boto3 S3 client except it uses boto3 DataSync client. DataSync does have separate costs.

We had the same problem but another requirement of ours was we needed to process 10GB-1TB per day and match two buckets s3 files exactly, if updated then we needed the dest bucket to be updated, if deleted we needed the s3 file to be deleted, if created then created.

The option by default on data sync 'TransferMode': 'CHANGED' does only changed files, this includes file names and sizes. Also "PreserveDeletedFiles": "REMOVE" is what is the code at, but based on your question I think you will want 'PreserveDeletedFiles': 'PRESERVE'.

Cost: You only pay for files moved of aDataSync task. So if a file exists in both buckets AND no change then no cost.

Perf: In regards to performance, tests not shown but I tested some buckets a couple of months ago and copied 720 GB in 20 minutes but I do not remember the count of files.

Use case: We use DataSync to perform a s3 bucket blue/green update where we do not want s3 replication because it would interfere with the hot s3 bucket while data was loading. Another place we use it is to migrate data during major application changes when buckets change or if we want to migrate data intra-bucket. I have found data sync client to be much faster to move data than s3 client, for large files and large number of files it runs much quicker than boto s3.

Cons:
You will need to create a IAM Role that has access to both buckets , encryption, and sts assume datasync. DataSync is not really useful if you moving <5 GB and < 1000 files since that is easy enough to accomplish with boto3s3. The reason for this is because of a start up time for DataSync tasks. There is a small additional cost, I think it was $10USD to move 720Gb, but since this only moves changed files this cost is not incurred unless you are modifying this much data is the s3 bucket. The other downside is there is no way to update the kms key so if moving data between CMK buckets you would have to update the keys to the new kms, this specific action is as slow as boto3 s3 client.

"""AWS DataSync an aws service to move/copy large amounts of data."""
import logging
import os

import boto3
import tenacity
from botocore import waiter
from botocore.exceptions import WaiterError


logger = logging.getLogger(__name__)


class SourceDirEmptyException(Exception):
    """
    Exception for when data sync runs on an emtpy source directory.
    This only occurs when 'PreserveDeletedFiles'='REMOVE and the
    source directory prefix is empty. The data sync task will fail and continue 
    to retry, this exception prevents retries as they contiue to fail.
    """

class DataSyncWaiter(object):
    """A AWS Data sync waiter class."""
    def __init__(self, client):
        """Init."""
        self._client = client
        self._waiter = waiter

    def wait_for_finished(self, task_execution_arn):
        """Wait for data sync to finish."""
        model = self._waiter.WaiterModel({
            "version": 2,
            "waiters": {
                "JobFinished": {
                    "delay":
                    1,
                    "operation":
                    "DescribeTaskExecution",
                    "description":
                    "Wait until AWS Data Sync starts finished",
                    "maxAttempts":
                    1000000,
                    "acceptors": [
                        {
                            "argument": "Status",
                            "expected": "SUCCESS",
                            "matcher": "path",
                            "state": "success",
                        },
                        {
                            "argument": "Status",
                            "expected": "ERROR",
                            "matcher": "path",
                            "state": "failure",
                        },
                    ],
                }
            },
        })
        self._waiter.create_waiter_with_client("JobFinished", model,
                                               self._client).wait(TaskExecutionArn=task_execution_arn)


class DataSyncClient:
    """A AWS DataSync client."""
    def __init__(self, client, role_arn, waiter: DataSyncWaiter = None) -> None:
        """Init."""
        self._client: boto3.client = client
        if waiter is None:
            waiter = DataSyncWaiter(client=client)
        self._waiter: DataSyncWaiter = waiter
        self._role_arn = role_arn

    def _delete_task(self, task_arn):
        """Delete a AWS DataSync task."""
        response = self._client.delete_task(TaskArn=task_arn)
        return response

    def _list_s3_locations(self):
        """List AWS DataSync locations."""
        locations = self._client.list_locations(MaxResults=100)
        if "Locations" in locations:
            return [x for x in locations["Locations"] if x["LocationUri"].startswith("s3://")]
        return []

    def _create_datasync_s3_location(self, bucket_name: str, subdirectory: str = ""):
        """Create AWS DataSync location."""
        return self._client.create_location_s3(
            Subdirectory=subdirectory,
            S3BucketArn=f"arn:aws:s3:::{bucket_name}",
            S3StorageClass="STANDARD",
            S3Config={"BucketAccessRoleArn": self._role_arn},
        )

    def _find_location_arn(self, bucket_name, subdirectory: str, locations_s3):
        """Find AWS DataSync LocationArn based on bucketname."""
        for x in locations_s3:
            # match the s3 location
            if bucket_name in x["LocationUri"] and subdirectory in x["LocationUri"]:
                # match the roles, these do not update frequently
                location_metadata = self._client.describe_location_s3(LocationArn=x["LocationArn"])
                if location_metadata['S3Config']['BucketAccessRoleArn'] == self._role_arn:
                    return x["LocationArn"]
        return self._create_datasync_s3_location(bucket_name=bucket_name, subdirectory=subdirectory)["LocationArn"]


    def move_data(self,
              task_name: str,
              source_bucket_name: str,
              dest_bucket_name: str,
              subdirectory: str,
              preserve_deleted_files: Literal['PRESERVE', 'REMOVE'] = "REMOVE") -> bool:
        """Move data using AWS DataSync tasks."""
        current_locations = self._list_s3_locations()
        source_s3_location_response = self._find_location_arn(bucket_name=source_bucket_name,
                                                              locations_s3=current_locations,
                                                              subdirectory=subdirectory)
        dest_s3_location_response = self._find_location_arn(bucket_name=dest_bucket_name,
                                                            locations_s3=current_locations,
                                                            subdirectory=subdirectory)
        logger.info("Moving data from SRC:{source} DEST:{dest}".format(
            source=os.path.join(source_bucket_name, subdirectory), dest=os.path.join(dest_bucket_name, subdirectory)))
        task = self._client.create_task(
            SourceLocationArn=source_s3_location_response,
            DestinationLocationArn=dest_s3_location_response,
            Name=f"{task_name}-sync",
            Options={
                "VerifyMode": "POINT_IN_TIME_CONSISTENT",
                "OverwriteMode": "ALWAYS",
                "PreserveDeletedFiles": preserve_deleted_files,
                # 'TransferMode': # 'CHANGED'|'ALL'
            },
        )
        self.start_task_waiting_for_complete(task_arn=task["TaskArn"])
        self._delete_task(task_arn=task["TaskArn"])
        return True

    @tenacity.retry(
        retry=tenacity.retry_if_exception_type(exception_types=(WaiterError)),
        wait=tenacity.wait_random_exponential(multiplier=0.5),
        stop=tenacity.stop_after_attempt(max_attempt_number=60),
        reraise=True,
        after=tenacity.after_log(logger, logging.INFO),
    )
    def start_task_waiting_for_complete(self, task_arn: str):
        """Start data move task, with retry because sometimes not all files get
        moved.

        It is not clear if this is because of eventual consistency in S3
        or the AWS service just does not handle constistency well.
        """

        try:
            task_started = self._client.start_task_execution(TaskArn=task_arn)
            self._waiter.wait_for_finished(task_execution_arn=task_started["TaskExecutionArn"])
         except Exception as ex:
            # last_response.Result.ErrorDetail': 'DataSync could not detect any objects in the source S3 bucket
            if type(ex) == WaiterError and ex.last_response['Result']['ErrorCode'] == 'SourceDirEmpty':
                # we do not want datasync continuing to fail
                # only occurs when 'PreserveDeletedFiles'='REMOVE
                raise SourceDirEmptyException(ex.last_response['Result']['ErrorDetail'])
            raise ex



def data_sync_move_data(task_name: str,
                        data_sync_role_arn: str,
                        source_bucket: str,
                        destination_bucket: str,
                        subdirectory: str,
                        datasync_client: boto3.client = None,
                        preserve_deleted_files: Literal['PRESERVE', 'REMOVE'] = "REMOVE"):
    """Move data from source bucket to destition bucket."""
    logger.info(f"DataSync: Moving all the data from {source_bucket} -> {destination_bucket}")
    if datasync_client is None:
        datasync_client = _utils.get_boto_client("datasync")
    datasync_client = DataSyncClient(client=datasync_client, role_arn=data_sync_role_arn)
    datasync_client.move_data(task_name=task_name,
                              source_bucket_name=source_bucket,
                              dest_bucket_name=destination_bucket,
                              subdirectory=subdirectory,
                              preserve_deleted_files=preserve_deleted_files)

Implentation then is

DATA_SYNC_ROLE_ARN = {
    "sand": "arn:aws:iam::123456789:role/Bucket-and-DataSync-Access-sand",
    "dev": "arn:aws:iam::123456789:role/Bucket-and-DataSync-Access-dev",
    "stg": "arn:aws:iam::123456789:role/Bucket-and-DataSync-Access-stg",
    "prod": "arn:aws:iam::123456789:role/Bucket-and-DataSync-Access-prod",
}
data_sync_move_data(task_name="migrate_data",
                        data_sync_role_arn=DATA_SYNC_ROLE_ARN[env],
                        source_bucket="old-bucket-name",
                        destination_bucket="new-bucket-name,
                        subdirectory="", # this is whole bucket
                        datasync_client=boto3.client('datasync'),                               
                        preserve_deleted_files='REMOVE'  # 'PRESERVE', 'REMOVE'
)

IAM Role example:

  Role:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Sub "Bucket-and-DataSync-Access-${Environment}"
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Allow"
            Principal:
              Service:
                - "datasync.amazonaws.com"
            Action:
              - "sts:AssumeRole"
    ...<s3 bucket access and encryption>
Answered By: vfrank66