Request proxies to access PyPI

Question:

I am trying to screenscrape PyPI packages using the requests library and beautiful soup – but am met with an indefinite hang. I am able to retrieve html from a number of sites with:

session = requests.Session()
session.trust_env = False
response = session.get("http://google.com")
print(response.status_code)

i.e. without providing headers. I read from Python request.get fails to get an answer for a url I can open on my browser that the indefinite hang is likely caused by incorrect headers. So, using the developer tools, I tried to grab my request headers from the Networking tab (using Edge) with "Doc" filter to select the pypi.org response/request. I simply copy pasted these into my header variable that is passed to the get method:

headers = {'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'session_id=<long string>',
'dnt': '1',
'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Microsoft Edge";v="108"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.54'}

(and changing get method to response = session.get("http://pypi.org", headers=headers))

But I get the same hang. So, I think something is wrong with my headers but I’m not sure what. I’m aware that the requests Session() "handles" cookies so I tried removing the cookie key/value pair in my request header dictionary but achieved the same result.

How can I determine the problem with my headers and/or why do my current headers not work (assuming this is even the problem)?

Asked By: Sterling Butters

||

Answers:

I tried sending a simple HTTP request to see if this server requires any headers for a normal response.

So I opened a TCP socket and connected to the Pypi server to see how requests would be handled by the server without the intervention of frameworks.
In addition, we wrap that socket in an SSL library to send encrypted traffic (HTTPS)

import socket
import ssl

hostname = 'pypi.org'
context = ssl.create_default_context()

payld = ("GET / HTTP/1.1rn"
         f"Host: {hostname}rnrn")
with socket.create_connection((hostname, 443)) as sock:
    with context.wrap_socket(sock, server_hostname=hostname) as ssock:
        text = payld
        ssock.sendall(text.encode())
        print(ssock.recv(40))

OUTPUT (It is only the first 40 bytes of the response, but we can see the status code, which is 200 OK):

b'HTTP/1.1 200 OKrnConnection: keep-aliver'

As a result, we can conclude that headers have no effect.

I recommend that you try this code.

  • If it works: Upgrade the version of the requests library, then try again.
  • If it does not work: I’m guessing it’s a network or SSL verification issue.
Answered By: Karen Petrosyan

HTTP headers are a possible issue, but not a likely one. A more probable cause is a proxy/firewall. I’ll start by recapping the information I think is relevant from the comments;

  • You are using a system, on which you do not have admin privileges.
  • The system is configured to use a corporate proxy server.
  • http://pypi.org works from your browser.
  • http://pypi.org works from a PowerShell on your system.
  • http://pypi.org hangs with your python code.
  • Your system is running Windows. (probably irrelevant, but might be worth noting)

As both your browser as well as PowerShell seem to work fine, if you didn’t change their settings, why are you trying to circumvent the proxy using python? (@vader asked this in comments, I didn’t see a relevant response)
If circumventing the proxy is material to your goal, skip this section to the next (after the horizontal bar). If it isn’t, as other programs seem to work fine, I suggest trying with the proxy using the system’s original configuration;

  1. Remove the session.trust_env = False statement from the code.
  2. Test the code now. If it works, our job is done . Otherwise, keep reading.
  3. Revert all system changes you’ve made trying to make it work.
  4. Reboot your system.
    I myself hate it when someone suggests that to me, but I found there are two good reasons to do that; the first is that there might be something stuck in the O/S and a reboot will release that, and the second is that I might not remember all the things I tinkered with to revert, and a reboot might do the job for me.
  5. Test again. Test the script, and with a browser, and with PowerShell (as per @yarin-007 ‘s comment).

If the script still hangs on requests to pypi, further analysis is required. In order to narrow down the options, I suggest the following:

  1. Disable redirects by setting allow_redirects=False. While requests should raise a TooManyRedirects exception if there is a redirect loop, this would help identify a case where a redirect target is hanging. pypi should redirect http to https regardless of user-agent, or most other headers, which makes for a consistent, reliable request, limiting other possible factors.
  2. Set a request timeout. The type of exception raised on timeout expiration can help identify the cause.

The following code provides a good example. For your code, don’t use the port numbers, the defaults should work. I added the port numbers explicitly, as each one demonstrates a different possible scenario:

#!/usr/bin/env python
import socket
import timeit
import requests

TIMEOUT = (4, 7)    # ConnectT/O (per-IP), ReadT/O

def get_url(url, timeout=TIMEOUT):
    try:
        response = requests.get(url, timeout=timeout, allow_redirects=False)
        print(f"Status code: {response.status_code}", end="")
        if response.status_code in (301, 302):
            print(f", Location: {response.headers.get('location')}", end="")
        print(".")
    except Exception as e:
        print(f"Exception caught: {e!r}")
    finally:
        print(f"Fetching url '{url}' done", end="")

def time_url(url):
    print(f"Trying url '{url}'")
    total = timeit.timeit(f"get_url('{url}')", number=1, globals=globals())
    print(f" in: {str(total)[:4]} seconds")
    print("=============")

def print_expected_conntimeout(server):
    r = socket.getaddrinfo(server, None, socket.AF_UNSPEC, socket.SOCK_STREAM)
    print(f"IP addresses of {server}:n" + "n".join(addr[-1][0] for addr in r))
    print(f"Got {len(r)} addresses, so expecting a a total ConnectTimeout of {len(r) * TIMEOUT[0]}")

def main():
    scheme = "http://"
    server = "pypi.org"
    uri = f"{scheme}{server}:{{port}}".format

    print_expected_conntimeout(server)
    # OK/redirect (301)
    time_url(uri(port=80))
    # READ TIMEOUT after 7s
    time_url(uri(port=8080))
    # CONNECTION TIMEOUT after 4 * ip_addresses
    time_url(uri(port=8082))
    # REJECT
    time_url('http://localhost:80')

if __name__ == "__main__":
    main()

For me, this outputs:

$ ./testnet.py
IP addresses of pypi.org:
151.101.128.223
151.101.0.223
151.101.64.223
151.101.192.223
Got 4 addresses, so expecting a a total ConnectTimeout of 16
Trying url 'http://pypi.org:80'
Status code: 301, Location: https://pypi.org/.
Fetching url 'http://pypi.org:80' done in: 0.66 seconds
=============
Trying url 'http://pypi.org:8080'
Exception caught: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='pypi.org', port=8080): Read timed out. (read timeout=7)"))
Fetching url 'http://pypi.org:8080' done in: 7.21 seconds
=============
Trying url 'http://pypi.org:8082'
Exception caught: ConnectTimeout(MaxRetryError("HTTPConnectionPool(host='pypi.org', port=8082): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x103ec4730>, 'Connection to pypi.org timed out. (connect timeout=4)'))"))
Fetching url 'http://pypi.org:8082' done in: 16.0 seconds
=============
Trying url 'http://localhost:80'
Exception caught: ConnectionError(MaxRetryError("HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x103ec44c0>: Failed to establish a new connection: [Errno 61] Connection refused'))"))
Fetching url 'http://localhost:80' done in: 0.00 seconds
=============

Now to explain the four cases:

  1. A successful request to http://pypi.org returns a 301 redirect – to use https.
    This is what you should get. If this is what you do get after adding allow_redirects=False, then the prime suspect is the redirect chain, and I suggest similarly checking each location header’s value for every redirect response you receive, until you find the URL that hangs.
  2. Connection to port 8080 is successful (successful 3-way handshake), but the server does not return a proper response, and "hangs". requests raises a ReadTimeout exception.
    If your script raises this exception, it is likely that you are connecting to some sort of proxy which would not properly relay (or actively block) the request or the response. There might be some other system setting controlling this than trust_env, or some appliance attached to the network’s infrastructure.
  3. Connection to port 8082 is not successful; a 3-way handshake could not be established, and requests raises a ConnectTimeout exception. Note that a connection would be attempted to each IP address found, so the timeout of 4 seconds would be multiplied by the amount of addresses, overall.
    If this is what you see, it is likely that there is some firewall between your machine and pypi, which either prevents your SYN packets getting to their destination, or prevents the SYN+ACK packet getting back from the server to your machine.
  4. The fourth case is provided as an example, which I don’t believe you’ll encounter, but in case you do it is worth explaining.
    In this case, the SYN packet either reached a server which does not listen on the desired port (which would be weird, possibly meaning you didn’t really reach pypi), or that a firewall REJECTed your SYN packet (vs. simply DROPping it).

Another thing worth paying attention to, is pypi’s IP addresses, as they are printed by the provided script. While IPv4 addresses are not guaranteed to keep their assignment, in this case if you find they are significantly different – that would suggest that you are not actually connecting to the real pypi servers, so the responses are unpredictable (including hangs). Following are pypi’s IPv4 and IPv6 addresses:

pypi.org has address 151.101.0.223
pypi.org has address 151.101.64.223
pypi.org has address 151.101.128.223
pypi.org has address 151.101.192.223
pypi.org has IPv6 address 2a04:4e42::223
pypi.org has IPv6 address 2a04:4e42:200::223
pypi.org has IPv6 address 2a04:4e42:400::223
pypi.org has IPv6 address 2a04:4e42:600::223

Finally, as we’ve touched the different IP protocol versions, it is also possible that when initiating a connection, your system attempts to use a protocol which has a faulty route to the destination (e.g. trying IPv6, but one of the gateways mishandles that traffic). Usually a router would reply with an ICMP failure message, but I’ve seen cases where that doesn’t happen (or isn’t properly relayed back). I wasn’t able to determine the root cause as the route was out of my control, but forcing a specific protocol solved that specific issue for me.

Hoping this provides some good debugging vectors, if this helps please add a comment, as I’m curious to what you find.

Answered By: micromoses

Got it!

I just had to set the proxy variable in the get method:

headers={'User-Agent': 'Chrome'}

proxies = {
  'http': 'xxxxxx:80',
  'https': 'xxxxxx:80',
}

def get_url(url):
    try:
        response = requests.get(url, timeout=10, allow_redirects=True, headers=headers, proxies=proxies)
        print(response.headers)
        print(response.text)
        print(response.history)
        print(f"Status code: {response.status_code}")
        if response.status_code in (301, 302):
            print(f", Location: {response.headers.get('location')}")

    except Exception as e:
        print(f"Exception caught: {e!r}")
    finally:
        print(f"Fetching url '{url}' done", end="")
        

url = "http://pypi.org"
get_url(url)
Answered By: Sterling Butters

ARE YOU SURE?
only the homepage of pypi raises that error which you cannot scrape in any case
do you have a firewall or a https or socks proxy?

i have taken url for all packages for python 3 and this codes works just fine

import requests
from bs4 import BeautifulSoup
hdr={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.5"}
session = requests.Session()
url="https://pypi.org/search/?c=Programming+Language+%3A%3A+Python+%3A%3A+3" # All Python 3 libraries
response = session.get(url,allow_redirects=True)
print(response.status_code)

> 200

just to make sure..lets scrape the package names to verify

soup=BeautifulSoup(response.content,"lxml")
pkgs=soup.findAll('span',attrs={'class':'package-snippet__name'})
for i in pkgs:
    print(i.text)
>

yingyu-yueyueyue-201812-201909
github-actions-cicd-example
xurl
unkey
fluvio
LogicCircuit
knarrow
riyu-zhuanye-kaoyan-202203-202206
permutation
aliases
sangsangjun-202011-202101
resultify
subnuker
keke-yingyu-202101-202104
xuezhaofeng-beida-jingjixue
jingtong-jiaoben-heike
mrbenn-toolbar-plugin
liuwei-yasi-pindao-201811-201908
mypy-boto3-service-quotas
trender

the names of all python 3 packages on page 1

Answered By: geekay