How to parse raw HTTP request in Python 3?
Question:
I am looking for a native way to parse an http request in Python 3.
This question shows a way to do it in Python 2, but uses now deprecated modules, (and Python 2) and I am looking for a way to do it in Python 3.
I would mainly like to just figure out what resource is requested and parse the headers and from a simple request. (i.e):
GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Can someone show me a basic way to parse this request?
Answers:
Each one of those field names should be delimited by carriage return then newline, and then the field name and value are delimited by a colon. So assuming you already have the response as a string, it should be as easy as:
fields = resp.split("rn")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
key,value = field.split(':', 1)#split each line by http field name and value
output[key] = value
Update 4/13
Using the example http resp in the linked to post:
resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1rnHost: www.google.comrnConnection: keep-alivernA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5rnUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13rnAccept-Encoding: gzip,deflate,sdchrnAvail-Dictionary: GeNLY2f-rnAccept-Language: en-US,en;q=0.8rn'
fields = resp.split("rn")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
if not field:
continue
key,value = field.split(':', 1)
output[key] = value
print(output)
An additional check to make sure field
is not empty is needed. OUtput:
{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}
You could use the email.message.Message
class from the email
module in the standard library.
By modifying the answer from the question you linked, below is a Python3 example of parsing HTTP headers.
Suppose you wanted to create a dictionary containing all of your header fields:
import email
import pprint
request_string = 'GET / HTTP/1.1rnHost: localhostrnConnection: keep-alivernCache-Control: max-age=0rnUpgrade-Insecure-Requests: 1rnUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36rnAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8rnAccept-Encoding: gzip, deflate, sdchrnAccept-Language: en-US,en;q=0.8'
# pop the first line so we only process headers
_, headers = request_string.split('rn', 1)
# construct a message from the request string. note: the return is already a dict-like object.
message = email.message_from_string(headers)
# construct a dictionary containing the headers
headers = dict(message.items())
# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)
if you ran this at a python prompt, the result would look like:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'localhost',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
Here are some Python packages aimed at proper HTTP protocol parsing:
- https://dpkt.readthedocs.io/en/latest/api/api_auto.html#module-dpkt.http
- https://h11.readthedocs.io/en/latest/
- https://github.com/benoitc/http-parser/ (C backend)
- https://github.com/MagicStack/httptools (based on NodeJS’s C backend)
- https://github.com/silentsignal/netlib-offline (shameless plug)
I am looking for a native way to parse an http request in Python 3.
This question shows a way to do it in Python 2, but uses now deprecated modules, (and Python 2) and I am looking for a way to do it in Python 3.
I would mainly like to just figure out what resource is requested and parse the headers and from a simple request. (i.e):
GET /index.html HTTP/1.1
Host: localhost
Connection: keep-alive
Cache-Control: max-age=0
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding: gzip, deflate, sdch
Accept-Language: en-US,en;q=0.8
Can someone show me a basic way to parse this request?
Each one of those field names should be delimited by carriage return then newline, and then the field name and value are delimited by a colon. So assuming you already have the response as a string, it should be as easy as:
fields = resp.split("rn")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
key,value = field.split(':', 1)#split each line by http field name and value
output[key] = value
Update 4/13
Using the example http resp in the linked to post:
resp = 'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1rnHost: www.google.comrnConnection: keep-alivernA
ccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5rnUser-Agent: Mozill
a/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.
13rnAccept-Encoding: gzip,deflate,sdchrnAvail-Dictionary: GeNLY2f-rnAccept-Language: en-US,en;q=0.8rn'
fields = resp.split("rn")
fields = fields[1:] #ignore the GET / HTTP/1.1
output = {}
for field in fields:
if not field:
continue
key,value = field.split(':', 1)
output[key] = value
print(output)
An additional check to make sure field
is not empty is needed. OUtput:
{'Host': ' www.google.com', 'Connection': ' keep-alive', 'Accept': ' application/xml,application/xhtml+xml,text/html;q=
0.9,text/plain;q=0.8,image/png,*/*;q=0.5', 'User-Agent': ' Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) App
leWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13', 'Accept-Encoding': ' gzip,deflate,sdch', 'Avail-D
ictionary': ' GeNLY2f-', 'Accept-Language': ' en-US,en;q=0.8'}
You could use the email.message.Message
class from the email
module in the standard library.
By modifying the answer from the question you linked, below is a Python3 example of parsing HTTP headers.
Suppose you wanted to create a dictionary containing all of your header fields:
import email
import pprint
request_string = 'GET / HTTP/1.1rnHost: localhostrnConnection: keep-alivernCache-Control: max-age=0rnUpgrade-Insecure-Requests: 1rnUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36rnAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8rnAccept-Encoding: gzip, deflate, sdchrnAccept-Language: en-US,en;q=0.8'
# pop the first line so we only process headers
_, headers = request_string.split('rn', 1)
# construct a message from the request string. note: the return is already a dict-like object.
message = email.message_from_string(headers)
# construct a dictionary containing the headers
headers = dict(message.items())
# pretty-print the dictionary of headers
pprint.pprint(headers, width=160)
if you ran this at a python prompt, the result would look like:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'en-US,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'localhost',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
Here are some Python packages aimed at proper HTTP protocol parsing:
- https://dpkt.readthedocs.io/en/latest/api/api_auto.html#module-dpkt.http
- https://h11.readthedocs.io/en/latest/
- https://github.com/benoitc/http-parser/ (C backend)
- https://github.com/MagicStack/httptools (based on NodeJS’s C backend)
- https://github.com/silentsignal/netlib-offline (shameless plug)