Intermittent ConnectTimeoutError from within Docker, but only when accessing AWS SSM

Question:

My app uses SSM Parameter Store from within a Docker container both on Fargate instances and locally. I’m accessing it with Boto3 from Python. Multiple developers on my team, in different countries, have seen an intermittent issue, cropping up maybe a few times a day during continuous development, where for 10 minutes or so, calls to SSM will fail with this error:

botocore.exceptions.ConnectTimeoutError: Connect timeout on endpoint URL: "https://ssm.us-east-2.amazonaws.com/"

The ECS instances do not see the issue as far as I’m aware, this is only a problem when we’re accessing the endpoint from our home networks.

I have added a connectivity test which fetches 4 URLs right before the Boto3 call which would fail. The first 3 always succeed. The 4th only succeeds sometimes:

https://www.apple.com                                status_code=200 len=104235 0.0448sec
https://cognito-idp.us-east-2.amazonaws.com          status_code=400 len=113    0.3786sec
https://xxxxxxxxxxxx.dkr.ecr.us-east-2.amazonaws.com status_code=401 len=15     0.3859sec
https://ssm.us-east-2.amazonaws.com                  status_code=404 len=29     0.3849sec

(Don’t be confused by the 40x status codes. Those are just because I haven’t sent a real, authenticated request. The key thing is that I received a timely response.)

This same request fails other times:

requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='ssm.us-east-2.amazonaws.com', port=443):
Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection
object at 0xffff8e6af550>, 'Connection to ssm.us-east-2.amazonaws.com timed out.
(connect timeout=3)'))

I set the timeout to 3 seconds here, but it has also timed out when I let the connection wait for over 2 minutes. This is a direct HTTPS fetch with requests, so I’m not even using boto3.

Some things we’ve tried:

  1. Restarting the Docker container sometimes seems to help, but other times doesn’t.
  2. Reducing the number of calls we make to SSM. It’s now down to about 2/sec per user at the maximum, with effectively no other users cuncurrently hitting the API. So we’re never getting anywhere near the 40 requests/second limit. In looking at the logs, the most I can see is 12 requests in one minute. We’re just not using this very agressively, so it doesn’t seem possible that the problem is throttling. All of our calls are paginated calls to GetParametersByPath, and we are using WithDecryption=true.
  3. Changing the Boto3 retry method from Legacy to Standard. This is probably a good thing to do anyway, but doesn’t seem to have fixed the problem.

The only reliable solution I’ve come up with is to wait. Eventually, the endpoint comes back and my application begins working again. But this is really an unacceptable level of service interruption, and I feel like I must be doing something wrong.

What would make only the SSM host become unreachable so often? I don’t see how this could be an issue with my Docker container if other URLs work just fine. But equally, if requests AND Boto3 are both failing, then it seems like it has to be either my container or the AWS endpoint itself. And obviously the us-east-2 SSM host isn’t constantly going down for minutes at a time.

I have tried pinging the endpoint during the problem, but from outside the container. And the results areā€¦ strange:

PING ssm.us-east-2.amazonaws.com (52.95.21.209): 56 data bytes
64 bytes from 52.95.21.209: icmp_seq=0 ttl=229 time=89.665 ms
64 bytes from 52.95.21.209: icmp_seq=1 ttl=229 time=92.928 ms
64 bytes from 52.95.21.209: icmp_seq=2 ttl=229 time=89.970 ms
64 bytes from 52.95.21.209: icmp_seq=3 ttl=229 time=92.004 ms
64 bytes from 52.95.21.209: icmp_seq=4 ttl=229 time=93.007 ms
64 bytes from 52.95.21.209: icmp_seq=5 ttl=229 time=93.066 ms
64 bytes from 52.95.21.209: icmp_seq=6 ttl=229 time=93.358 ms
64 bytes from 52.95.21.209: icmp_seq=7 ttl=229 time=87.980 ms
64 bytes from 52.95.21.209: icmp_seq=8 ttl=229 time=92.416 ms
64 bytes from 52.95.21.209: icmp_seq=9 ttl=229 time=92.361 ms
64 bytes from 52.95.21.209: icmp_seq=10 ttl=229 time=88.709 ms
64 bytes from 52.95.21.209: icmp_seq=11 ttl=229 time=91.613 ms
64 bytes from 52.95.21.209: icmp_seq=12 ttl=229 time=93.175 ms
64 bytes from 52.95.21.209: icmp_seq=13 ttl=229 time=93.545 ms
Request timeout for icmp_seq 14
Request timeout for icmp_seq 15
Request timeout for icmp_seq 16
Request timeout for icmp_seq 17
64 bytes from 52.95.21.209: icmp_seq=18 ttl=229 time=89.668 ms
64 bytes from 52.95.21.209: icmp_seq=19 ttl=229 time=93.205 ms
64 bytes from 52.95.21.209: icmp_seq=20 ttl=229 time=92.234 ms
64 bytes from 52.95.21.209: icmp_seq=21 ttl=229 time=92.995 ms
64 bytes from 52.95.21.209: icmp_seq=22 ttl=229 time=93.140 ms
64 bytes from 52.95.21.209: icmp_seq=23 ttl=229 time=92.720 ms
64 bytes from 52.95.21.209: icmp_seq=24 ttl=229 time=93.945 ms
64 bytes from 52.95.21.209: icmp_seq=25 ttl=229 time=93.641 ms
64 bytes from 52.95.21.209: icmp_seq=26 ttl=229 time=93.599 ms
64 bytes from 52.95.21.209: icmp_seq=27 ttl=229 time=91.851 ms
64 bytes from 52.95.21.209: icmp_seq=28 ttl=229 time=90.349 ms
64 bytes from 52.95.21.209: icmp_seq=29 ttl=229 time=95.998 ms
64 bytes from 52.95.21.209: icmp_seq=30 ttl=229 time=93.568 ms
64 bytes from 52.95.21.209: icmp_seq=31 ttl=229 time=93.292 ms
64 bytes from 52.95.21.209: icmp_seq=32 ttl=229 time=93.491 ms
64 bytes from 52.95.21.209: icmp_seq=33 ttl=229 time=93.167 ms
Request timeout for icmp_seq 34
64 bytes from 52.95.21.209: icmp_seq=35 ttl=229 time=93.613 ms
64 bytes from 52.95.21.209: icmp_seq=36 ttl=229 time=91.564 ms
64 bytes from 52.95.21.209: icmp_seq=37 ttl=229 time=96.495 ms
64 bytes from 52.95.21.209: icmp_seq=38 ttl=229 time=93.870 ms
64 bytes from 52.95.21.209: icmp_seq=39 ttl=229 time=93.629 ms
64 bytes from 52.95.21.209: icmp_seq=40 ttl=229 time=93.487 ms
64 bytes from 52.95.21.209: icmp_seq=41 ttl=229 time=96.892 ms
64 bytes from 52.95.21.209: icmp_seq=42 ttl=229 time=91.220 ms
64 bytes from 52.95.21.209: icmp_seq=43 ttl=229 time=93.394 ms
64 bytes from 52.95.21.209: icmp_seq=44 ttl=229 time=91.774 ms
64 bytes from 52.95.21.209: icmp_seq=45 ttl=229 time=94.031 ms
Request timeout for icmp_seq 46
64 bytes from 52.95.21.209: icmp_seq=47 ttl=229 time=96.748 ms
64 bytes from 52.95.21.209: icmp_seq=48 ttl=229 time=93.024 ms
64 bytes from 52.95.21.209: icmp_seq=49 ttl=229 time=92.414 ms
64 bytes from 52.95.21.209: icmp_seq=50 ttl=229 time=96.475 ms
64 bytes from 52.95.21.209: icmp_seq=51 ttl=229 time=93.447 ms
64 bytes from 52.95.21.209: icmp_seq=52 ttl=229 time=92.959 ms
64 bytes from 52.95.21.209: icmp_seq=53 ttl=229 time=93.353 ms
64 bytes from 52.95.21.209: icmp_seq=54 ttl=229 time=93.371 ms
64 bytes from 52.95.21.209: icmp_seq=55 ttl=229 time=92.530 ms
64 bytes from 52.95.21.209: icmp_seq=56 ttl=229 time=94.401 ms
64 bytes from 52.95.21.209: icmp_seq=57 ttl=229 time=93.797 ms
64 bytes from 52.95.21.209: icmp_seq=58 ttl=229 time=92.076 ms
Request timeout for icmp_seq 59
64 bytes from 52.95.21.209: icmp_seq=60 ttl=229 time=91.602 ms
64 bytes from 52.95.21.209: icmp_seq=61 ttl=229 time=92.835 ms
Request timeout for icmp_seq 62
64 bytes from 52.95.21.209: icmp_seq=63 ttl=229 time=92.903 ms
64 bytes from 52.95.21.209: icmp_seq=64 ttl=229 time=93.302 ms
64 bytes from 52.95.21.209: icmp_seq=65 ttl=229 time=93.623 ms
64 bytes from 52.95.21.209: icmp_seq=66 ttl=229 time=93.638 ms
64 bytes from 52.95.21.209: icmp_seq=67 ttl=229 time=93.395 ms
64 bytes from 52.95.21.209: icmp_seq=68 ttl=229 time=92.432 ms

Those points where the pings time out are exactly when the container prints the "ConnectTimeoutError" to the Docker console. I don’t know what to make of this.

Is there a setting I have overlooked?

Asked By: Nick K9

||

Answers:

This turned out to be a Docker Desktop issue. You can work around it by using an older version of Docker Desktop, 4.5.0 (Mac) or 4.5.1 (Win).

Answered By: Nick K9