Mitigating TCP connection resets in AWS Fargate

Question:

I am using Amazon ECS on AWS Fargate, My instances can access the internet, but the connection drops after 350 seconds. On average, out of 100 times, my service is getting ConnectionResetError: [Errno 104] Connection reset by peer error approximately 5 times. I found a couple of suggestions to fix that issue on my server-side code, see here and here

Cause

If a connection that’s using a NAT gateway is idle for 350 seconds or more, the connection times out.

When a connection times out, a NAT gateway returns an RST packet to any resources behind the NAT gateway that attempt to continue the connection (it does not send a FIN packet).

Solution

To prevent the connection from being dropped, you can initiate more traffic over the connection. Alternatively, you can enable TCP keepalive on the instance with a value less than 350 seconds.

Existing Code:

url = "url to call http"
params = {
   "year": year,
   "month": month
}
response = self.session.get(url, params=params)

To fix that I am currently using a band-aid retry logic solution using tenacity,

@retry(
        retry=(
            retry_if_not_exception_type(
                HTTPError
            )  # specific: requests.exceptions.ConnectionError
        ),
        reraise=True,
        wait=wait_fixed(2),
        stop=stop_after_attempt(5),
)
def call_to_api():
    url = "url to call HTTP"
    params = {
       "year": year,
       "month": month
    }
    response = self.session.get(url, params=params)

So my basic question is how can I use python requests correctly to do any of the below solutions,

  • Close the connection before 350 seconds of inactivity

  • Enable Keep-Alive for TCP connections

Asked By: Always Sunny

||

Answers:

Concerning the "Close the connection before 350 seconds of inactivity" problem, there seems to be a read timeout parameter you can pass to the session.get() function call.

According to the doc "it’s the number of seconds that the client will wait between bytes sent from the server".
Which, to me, looks like an inactivity timeout.

Answered By: Giorgio Ruffa

Posting solution for the future user who will face this issue while working on AWS Farget + NAT,

We need to set the TCP keepalive settings to the values dictated by our server-side configuration, this PR helps me a lot to fix my issue: https://github.com/customerio/customerio-python/pull/70/files

import socket
from urllib3.connection import HTTPConnection


HTTPConnection.default_socket_options = ( HTTPConnection.default_socket_options + [
        (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
        (socket.SOL_TCP, socket.TCP_KEEPIDLE, 300),
        (socket.SOL_TCP, socket.TCP_KEEPINTVL, 60)
        ]
)
Answered By: Always Sunny