botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden while using local mode in AWS SageMaker
Question:
trainer = PyTorch(
entry_point="train.py",
source_dir= "source_dir", # directory of your training script
role=role,
framework_version="1.5.0",
py_version="py3",
instance_type= "local",
instance_count=1,
output_path=output_path,
hyperparameters = hyperparameters
)
This code is running SageMaker NoteNook instance.
Error
Creating dsd3faq5lq-algo-1-ouews ...
Creating dsd3faq5lq-algo-1-ouews ... done
Attaching to dsd3faq5lq-algo-1-ouews
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,444 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,475 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,494 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,507 sagemaker_pytorch_container.training INFO Invoking user training script.
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,673 sagemaker-training-toolkit ERROR Reporting training FAILURE
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,674 sagemaker-training-toolkit ERROR framework error:
dsd3faq5lq-algo-1-ouews | Traceback (most recent call last):
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 85, in train
dsd3faq5lq-algo-1-ouews | entrypoint()
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 121, in main
dsd3faq5lq-algo-1-ouews | train(environment.Environment())
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 73, in train
dsd3faq5lq-algo-1-ouews | runner_type=runner_type)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py", line 92, in run
dsd3faq5lq-algo-1-ouews | files.download_and_extract(uri=uri, path=environment.code_dir)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 131, in download_and_extract
dsd3faq5lq-algo-1-ouews | s3_download(uri, dst)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 167, in s3_download
dsd3faq5lq-algo-1-ouews | s3.Bucket(bucket).download_file(key, dst)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
dsd3faq5lq-algo-1-ouews | ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 172, in download_file
dsd3faq5lq-algo-1-ouews | extra_args=ExtraArgs, callback=Callback)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/boto3/s3/transfer.py", line 307, in download_file
dsd3faq5lq-algo-1-ouews | future.result()
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 106, in result
dsd3faq5lq-algo-1-ouews | return self._coordinator.result()
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 265, in result
dsd3faq5lq-algo-1-ouews | raise self._exception
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/tasks.py", line 255, in _main
dsd3faq5lq-algo-1-ouews | self._submit(transfer_future=transfer_future, **kwargs)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/download.py", line 343, in _submit
dsd3faq5lq-algo-1-ouews | **transfer_future.meta.call_args.extra_args
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 357, in _api_call
dsd3faq5lq-algo-1-ouews | return self._make_api_call(operation_name, kwargs)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 676, in _make_api_call
dsd3faq5lq-algo-1-ouews | raise error_class(parsed_response, operation_name)
dsd3faq5lq-algo-1-ouews | botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
dsd3faq5lq-algo-1-ouews |
dsd3faq5lq-algo-1-ouews | An error occurred (403) when calling the HeadObject operation: Forbidden
dsd3faq5lq-algo-1-ouews exited with code 1
1
Aborting on container exit...
The same code worked a couple of days back but it is not working from yesterday onwards.
Sagemaker Version: 2.115.0
I tried changing the SageMaker version to 2.0 but to no avail.
I also uploaded training code on s3 and use s3_uri in "source_dir" but still got the same error.
The problem only occurs while using instance_type = "local"
It works perfectly fine if I use a remote instance such as instance_type = "ml.c4.xlarge"
I also tried uploading and downloading objects directly from S3 using boto3 and it worked perfectly fine.
eg;
session = boto3.Session()
s3 = session.resource('s3')
s3_obj = s3.Object(bucket, data_key)
s3_obj.put(Body=data)
s3client = boto3.client('s3')
response = s3client.get_object(Bucket= bucket, Key= data_key)
body = response['Body'].read()
The above code works perfectly fine in same instance.
I am not sure but if it was just an issue with permission, wouldn’t the above code also not work.
Answers:
The fact that the code worked until a few days ago does not make the problem reproducible.
At this point, it is strictly dependent on the settings related to your AWS account.
Looking at the error log, it appears that you do not have access permissions to the S3 bucket.
Look at this question, it talks about this very same error:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
Check the IAM policies associated with the IAM role that the Lambda
function is using.
I changed the code to below, replacing Sagwmaker.pytorch.Pytorch with sagemaker.estimator.Estimator, providing reference to the relevant Docker Image URI.
from sagemaker.estimator import Estimator
TRAIN_IMAGE_URI= "763104351884.dkr.ecr.ap-south-1.amazonaws.com/pytorch-training:1.12.1-cpu-py38"
trainer = Estimator(
image_uri = TRAIN_IMAGE_URI,
entry_point="train.py",
source_dir="source_dir", # directory of your training script
role=role,
base_job_name = base_job_name,
instance_type= "local",
instance_count=1,
output_path=output_path,
hyperparameters = hyperparameters
)
trainer.fit()
This seems to work perfectly in my case.
Although I still do not understand why.
trainer = PyTorch(
entry_point="train.py",
source_dir= "source_dir", # directory of your training script
role=role,
framework_version="1.5.0",
py_version="py3",
instance_type= "local",
instance_count=1,
output_path=output_path,
hyperparameters = hyperparameters
)
This code is running SageMaker NoteNook instance.
Error
Creating dsd3faq5lq-algo-1-ouews ...
Creating dsd3faq5lq-algo-1-ouews ... done
Attaching to dsd3faq5lq-algo-1-ouews
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,444 sagemaker-training-toolkit INFO Imported framework sagemaker_pytorch_container.training
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,475 sagemaker-training-toolkit INFO No GPUs detected (normal if no gpus installed)
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,494 sagemaker_pytorch_container.training INFO Block until all host DNS lookups succeed.
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,507 sagemaker_pytorch_container.training INFO Invoking user training script.
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,673 sagemaker-training-toolkit ERROR Reporting training FAILURE
dsd3faq5lq-algo-1-ouews | 2022-11-10 05:39:26,674 sagemaker-training-toolkit ERROR framework error:
dsd3faq5lq-algo-1-ouews | Traceback (most recent call last):
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/trainer.py", line 85, in train
dsd3faq5lq-algo-1-ouews | entrypoint()
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 121, in main
dsd3faq5lq-algo-1-ouews | train(environment.Environment())
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_pytorch_container/training.py", line 73, in train
dsd3faq5lq-algo-1-ouews | runner_type=runner_type)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/entry_point.py", line 92, in run
dsd3faq5lq-algo-1-ouews | files.download_and_extract(uri=uri, path=environment.code_dir)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 131, in download_and_extract
dsd3faq5lq-algo-1-ouews | s3_download(uri, dst)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/sagemaker_training/files.py", line 167, in s3_download
dsd3faq5lq-algo-1-ouews | s3.Bucket(bucket).download_file(key, dst)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 246, in bucket_download_file
dsd3faq5lq-algo-1-ouews | ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/boto3/s3/inject.py", line 172, in download_file
dsd3faq5lq-algo-1-ouews | extra_args=ExtraArgs, callback=Callback)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/boto3/s3/transfer.py", line 307, in download_file
dsd3faq5lq-algo-1-ouews | future.result()
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 106, in result
dsd3faq5lq-algo-1-ouews | return self._coordinator.result()
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/futures.py", line 265, in result
dsd3faq5lq-algo-1-ouews | raise self._exception
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/tasks.py", line 255, in _main
dsd3faq5lq-algo-1-ouews | self._submit(transfer_future=transfer_future, **kwargs)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/s3transfer/download.py", line 343, in _submit
dsd3faq5lq-algo-1-ouews | **transfer_future.meta.call_args.extra_args
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 357, in _api_call
dsd3faq5lq-algo-1-ouews | return self._make_api_call(operation_name, kwargs)
dsd3faq5lq-algo-1-ouews | File "/opt/conda/lib/python3.6/site-packages/botocore/client.py", line 676, in _make_api_call
dsd3faq5lq-algo-1-ouews | raise error_class(parsed_response, operation_name)
dsd3faq5lq-algo-1-ouews | botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
dsd3faq5lq-algo-1-ouews |
dsd3faq5lq-algo-1-ouews | An error occurred (403) when calling the HeadObject operation: Forbidden
dsd3faq5lq-algo-1-ouews exited with code 1
1
Aborting on container exit...
The same code worked a couple of days back but it is not working from yesterday onwards.
Sagemaker Version: 2.115.0
I tried changing the SageMaker version to 2.0 but to no avail.
I also uploaded training code on s3 and use s3_uri in "source_dir" but still got the same error.
The problem only occurs while using instance_type = "local"
It works perfectly fine if I use a remote instance such as instance_type = "ml.c4.xlarge"
I also tried uploading and downloading objects directly from S3 using boto3 and it worked perfectly fine.
eg;
session = boto3.Session()
s3 = session.resource('s3')
s3_obj = s3.Object(bucket, data_key)
s3_obj.put(Body=data)
s3client = boto3.client('s3')
response = s3client.get_object(Bucket= bucket, Key= data_key)
body = response['Body'].read()
The above code works perfectly fine in same instance.
I am not sure but if it was just an issue with permission, wouldn’t the above code also not work.
The fact that the code worked until a few days ago does not make the problem reproducible.
At this point, it is strictly dependent on the settings related to your AWS account.
Looking at the error log, it appears that you do not have access permissions to the S3 bucket.
Look at this question, it talks about this very same error:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
Check the IAM policies associated with the IAM role that the Lambda
function is using.
I changed the code to below, replacing Sagwmaker.pytorch.Pytorch with sagemaker.estimator.Estimator, providing reference to the relevant Docker Image URI.
from sagemaker.estimator import Estimator
TRAIN_IMAGE_URI= "763104351884.dkr.ecr.ap-south-1.amazonaws.com/pytorch-training:1.12.1-cpu-py38"
trainer = Estimator(
image_uri = TRAIN_IMAGE_URI,
entry_point="train.py",
source_dir="source_dir", # directory of your training script
role=role,
base_job_name = base_job_name,
instance_type= "local",
instance_count=1,
output_path=output_path,
hyperparameters = hyperparameters
)
trainer.fit()
This seems to work perfectly in my case.
Although I still do not understand why.