Python Socket only returns Response header instead of HTML

Question:

I want to extract links from a website js. Using sockets, I’m trying to get the web JS but it always shows response header and not an actual JS/HTML. Here’s what I’m using:

import socket
import ssl

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
cont = ssl.create_default_context()
sock.connect(('blog.clova.line.me', 443))
sock = cont.wrap_socket(sock, server_hostname = 'blog.clova.line.me')
sock.sendall('GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1rnHost: blog.clova.line.mernrn'.encode())
resp = sock.recv(2048)
print(resp.decode('utf-8'))

It returns only response header:

HTTP/1.1 200 OK
Date: Tue, 06 Sep 2022 12:02:38 GMT
Content-Type: application/javascript
Transfer-Encoding: chunked
Connection: keep-alive
CF-Ray: 74670e8b9b594c2f-SIN
Age: 3444278
...

I have tried the following:

  1. Setting Content-Type: text/plain; charset=utf-8 header
  2. Changing the header to GET https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1

Have been searching related, it’s seems that: other people is able to achieve HTML data after response header are received, but for me; I only able to receive the headers and not the HTML data. Frankly, it’s working on requests:

resp = requests.get('https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js')
print(resp.text)

How can I achieve similar result using socket? Honestly, I don’t like using 3rd-party module that’s why I’m not using requests.

Asked By: Xav

||

Answers:

The response is just truncated: sock.recv(2048) is reading just the first 2048 bytes. If you read more bytes, you will see the body after the headers.

Anyway, I wouldn’t recommend doing that using such a low level library.

Honestly, I don’t like
using 3rd-party module that’s why I’m not using requests.

If your point is to stick to the python standard library, you can use urrlib.request which provides more abstraction than socket:

import urllib
req = urllib.request.urlopen('…')
print(req.read())
Answered By: etuardu

From documentation:

Now we come to the major stumbling block of sockets – send and recv
operate on the network buffers. They do not necessarily handle all the
bytes you hand them (or expect from them), because their major focus
is handling the network buffers. In general, they return when the
associated network buffers have been filled (send) or emptied (recv).
They then tell you how many bytes they handled. It is your
responsibility to call them again until your message has been
completely dealt with.

I’ve re-write your code and added a receive_all function, which handles the received bytes: (Of course it’s a naive implementation)

import socket
import ssl

request_text = (
    "GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js "
    "HTTP/1.1rnHost: blog.clova.line.mernrn"
)

host_name = "blog.clova.line.me"


def receive_all(sock):
    chunks: list[bytes] = []
    while True:
        chunk = sock.recv(2048)
        if not chunk.endswith(b"0rnrn"):
            chunks.append(chunk)
        else:
            break
    return b"".join(chunks)



cont = ssl.create_default_context()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
    sock.settimeout(5)
    with cont.wrap_socket(sock, server_hostname=host_name) as ssock:
        ssock.connect((host_name, 443))
        ssock.sendall(request_text.encode())

        resp = receive_all(ssock)
        print(resp.decode("utf-8"))
Answered By: S.B
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.