Get html using Python requests?
Question:
I am trying to teach myself some basic web scraping. Using Python’s requests module, I was able to grab html for various websites until I tried this:
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
Instead of the basic html that is the source for this page, I get:
>>> r.text
'x1fufffdx08x00x00x00x00x00x00x03ufffd]ou06f8x12ufffdufffdufffd+ufffd]...
>>> r.content
b'x1fx8bx08x00x00x00x00x00x00x03xedx9d]oxdbxb8x12x86xefxfb+x88]x14h...
I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don’t understand what I am seeing above, haven’t been able to turn it into anything I can read, and can’t figure out how to get what I actually want. My question is, how do I get the html for the above page?
Answers:
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..>
line there is not a valid HTTP header. As such, the remaining headers past Server
are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py
is a CGI script that doesn’t output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests
also doesn’t detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn’t rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding
information survives, so there requests
decodes the content for you, as expected.
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'n<!DOCTYPE html>n<HTML>n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
I’d solve that problem in a more simple way. Just import html
library to decode HTML special characters:
import html
r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
print(html.unescape(r.text))
Here is an example using the BeautifulSoup library. It "makes it easy to scrape information from web pages."
from bs4 import BeautifulSoup
import requests
# request web page
resp = requests.get("http://example.com")
# get the response text. in this case it is HTML
html = resp.text
# parse the HTML
soup = BeautifulSoup(html, "html.parser")
# print the HTML as text
print(soup.body.get_text().strip())
and the result
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...
I am trying to teach myself some basic web scraping. Using Python’s requests module, I was able to grab html for various websites until I tried this:
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
Instead of the basic html that is the source for this page, I get:
>>> r.text
'x1fufffdx08x00x00x00x00x00x00x03ufffd]ou06f8x12ufffdufffdufffd+ufffd]...
>>> r.content
b'x1fx8bx08x00x00x00x00x00x00x03xedx9d]oxdbxb8x12x86xefxfb+x88]x14h...
I have tried many combinations of get/post with every syntax I can guess from the documentation and from SO and other examples. I don’t understand what I am seeing above, haven’t been able to turn it into anything I can read, and can’t figure out how to get what I actually want. My question is, how do I get the html for the above page?
The server in question is giving you a gzipped response. The server is also very broken; it sends the following headers:
$ curl -D - -o /dev/null -s -H 'Accept-Encoding: gzip, deflate' http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F
HTTP/1.1 200 OK
Date: Tue, 06 Jan 2015 17:46:49 GMT
Server: Apache
<!DOCTYPE HTML PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"><html lang="en-US">
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 3659
Content-Type: text/html
The <!DOCTYPE..>
line there is not a valid HTTP header. As such, the remaining headers past Server
are ignored. Why the server interjects that is unclear; in all likely hood WRCCWrappers.py
is a CGI script that doesn’t output headers but does include a double newline after the doctype line, duping the Apache server into inserting additional headers there.
As such, requests
also doesn’t detect that the data is gzip-encoded. The data is all there, you just have to decode it. Or you could if it wasn’t rather incomplete.
The work-around is to tell the server not to bother with compression:
headers = {'Accept-Encoding': 'identity'}
r = requests.get(url, headers=headers)
and an uncompressed response is returned.
Incidentally, on Python 2 the HTTP header parser is not so strict and manages to declare the doctype a header:
>>> pprint(dict(r.headers))
{'<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "dtd/xhtml1-transitional.dtd"><html lang="en-US">',
'connection': 'Keep-Alive',
'content-encoding': 'gzip',
'content-length': '3659',
'content-type': 'text/html',
'date': 'Tue, 06 Jan 2015 17:42:06 GMT',
'keep-alive': 'timeout=5, max=100',
'server': 'Apache',
'vary': 'Accept-Encoding'}
and the content-encoding
information survives, so there requests
decodes the content for you, as expected.
The HTTP headers for this URL have now been fixed.
>>> import requests
>>> print requests.__version__
2.5.1
>>> r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
>>> r.text[:100]
u'n<!DOCTYPE html>n<HTML>n<HEAD><TITLE>Monthly Average of Precipitation, Station id: 028815</TITLE></H'
>>> r.headers
{'content-length': '3672', 'content-encoding': 'gzip', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'server': 'Apache', 'connection': 'Keep-Alive', 'date': 'Thu, 12 Feb 2015 18:59:37 GMT', 'content-type': 'text/html; charset=utf-8'}
I’d solve that problem in a more simple way. Just import html
library to decode HTML special characters:
import html
r = requests.get('http://www.wrcc.dri.edu/WRCCWrappers.py?sodxtrmts+028815+por+por+pcpn+none+mave+5+01+F')
print(html.unescape(r.text))
Here is an example using the BeautifulSoup library. It "makes it easy to scrape information from web pages."
from bs4 import BeautifulSoup
import requests
# request web page
resp = requests.get("http://example.com")
# get the response text. in this case it is HTML
html = resp.text
# parse the HTML
soup = BeautifulSoup(html, "html.parser")
# print the HTML as text
print(soup.body.get_text().strip())
and the result
Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
More information...