Python Socket only returns Response header instead of HTML
Question:
I want to extract links from a website js. Using sockets, I’m trying to get the web JS but it always shows response header and not an actual JS/HTML. Here’s what I’m using:
import socket
import ssl
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
cont = ssl.create_default_context()
sock.connect(('blog.clova.line.me', 443))
sock = cont.wrap_socket(sock, server_hostname = 'blog.clova.line.me')
sock.sendall('GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1rnHost: blog.clova.line.mernrn'.encode())
resp = sock.recv(2048)
print(resp.decode('utf-8'))
It returns only response header:
HTTP/1.1 200 OK
Date: Tue, 06 Sep 2022 12:02:38 GMT
Content-Type: application/javascript
Transfer-Encoding: chunked
Connection: keep-alive
CF-Ray: 74670e8b9b594c2f-SIN
Age: 3444278
...
I have tried the following:
- Setting
Content-Type: text/plain; charset=utf-8
header
- Changing the header to
GET https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1
Have been searching related, it’s seems that: other people is able to achieve HTML data after response header are received, but for me; I only able to receive the headers and not the HTML data. Frankly, it’s working on requests
:
resp = requests.get('https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js')
print(resp.text)
How can I achieve similar result using socket
? Honestly, I don’t like using 3rd-party module that’s why I’m not using requests
.
Answers:
The response is just truncated: sock.recv(2048)
is reading just the first 2048 bytes. If you read more bytes, you will see the body after the headers.
Anyway, I wouldn’t recommend doing that using such a low level library.
Honestly, I don’t like
using 3rd-party module that’s why I’m not using requests
.
If your point is to stick to the python standard library, you can use urrlib.request
which provides more abstraction than socket
:
import urllib
req = urllib.request.urlopen('…')
print(req.read())
From documentation:
Now we come to the major stumbling block of sockets – send and recv
operate on the network buffers. They do not necessarily handle all the
bytes you hand them (or expect from them), because their major focus
is handling the network buffers. In general, they return when the
associated network buffers have been filled (send) or emptied (recv).
They then tell you how many bytes they handled. It is your
responsibility to call them again until your message has been
completely dealt with.
I’ve re-write your code and added a receive_all
function, which handles the received bytes: (Of course it’s a naive implementation)
import socket
import ssl
request_text = (
"GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js "
"HTTP/1.1rnHost: blog.clova.line.mernrn"
)
host_name = "blog.clova.line.me"
def receive_all(sock):
chunks: list[bytes] = []
while True:
chunk = sock.recv(2048)
if not chunk.endswith(b"0rnrn"):
chunks.append(chunk)
else:
break
return b"".join(chunks)
cont = ssl.create_default_context()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.settimeout(5)
with cont.wrap_socket(sock, server_hostname=host_name) as ssock:
ssock.connect((host_name, 443))
ssock.sendall(request_text.encode())
resp = receive_all(ssock)
print(resp.decode("utf-8"))
I want to extract links from a website js. Using sockets, I’m trying to get the web JS but it always shows response header and not an actual JS/HTML. Here’s what I’m using:
import socket
import ssl
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
cont = ssl.create_default_context()
sock.connect(('blog.clova.line.me', 443))
sock = cont.wrap_socket(sock, server_hostname = 'blog.clova.line.me')
sock.sendall('GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1rnHost: blog.clova.line.mernrn'.encode())
resp = sock.recv(2048)
print(resp.decode('utf-8'))
It returns only response header:
HTTP/1.1 200 OK
Date: Tue, 06 Sep 2022 12:02:38 GMT
Content-Type: application/javascript
Transfer-Encoding: chunked
Connection: keep-alive
CF-Ray: 74670e8b9b594c2f-SIN
Age: 3444278
...
I have tried the following:
- Setting
Content-Type: text/plain; charset=utf-8
header - Changing the header to
GET https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js HTTP/1.1
Have been searching related, it’s seems that: other people is able to achieve HTML data after response header are received, but for me; I only able to receive the headers and not the HTML data. Frankly, it’s working on requests
:
resp = requests.get('https://blog.clova.line.me/hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js')
print(resp.text)
How can I achieve similar result using socket
? Honestly, I don’t like using 3rd-party module that’s why I’m not using requests
.
The response is just truncated: sock.recv(2048)
is reading just the first 2048 bytes. If you read more bytes, you will see the body after the headers.
Anyway, I wouldn’t recommend doing that using such a low level library.
Honestly, I don’t like
using 3rd-party module that’s why I’m not usingrequests
.
If your point is to stick to the python standard library, you can use urrlib.request
which provides more abstraction than socket
:
import urllib
req = urllib.request.urlopen('…')
print(req.read())
From documentation:
Now we come to the major stumbling block of sockets – send and recv
operate on the network buffers. They do not necessarily handle all the
bytes you hand them (or expect from them), because their major focus
is handling the network buffers. In general, they return when the
associated network buffers have been filled (send) or emptied (recv).
They then tell you how many bytes they handled. It is your
responsibility to call them again until your message has been
completely dealt with.
I’ve re-write your code and added a receive_all
function, which handles the received bytes: (Of course it’s a naive implementation)
import socket
import ssl
request_text = (
"GET /hs/hsstatic/HubspotToolsMenu/static-1.138/js/index.js "
"HTTP/1.1rnHost: blog.clova.line.mernrn"
)
host_name = "blog.clova.line.me"
def receive_all(sock):
chunks: list[bytes] = []
while True:
chunk = sock.recv(2048)
if not chunk.endswith(b"0rnrn"):
chunks.append(chunk)
else:
break
return b"".join(chunks)
cont = ssl.create_default_context()
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as sock:
sock.settimeout(5)
with cont.wrap_socket(sock, server_hostname=host_name) as ssock:
ssock.connect((host_name, 443))
ssock.sendall(request_text.encode())
resp = receive_all(ssock)
print(resp.decode("utf-8"))