Python web-scraping error – TypeError: can't use a string pattern on a bytes-like object
Question:
I want to build a web scraper. Currently, I’m learning Python. This is the very basics!
Python Code
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://basketball.realgm.com/")
htmltext = htmlfile.read()
title = re.findall('<title>(.*)</title>', htmltext)
print (htmltext)
Error:
File "C:Python33libre.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Answers:
Use bytes literal as pattern:
title = re.findall(b'<title>(.*)</title>', htmltext)
or decode the retrieved data to string:
title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))
(change utf-8
with appropriate encoding of the document)
You have to decode your data. Since the website in question says
charset=iso-8859-1
use that. utf-8 won’t work in this case.
htmltext = htmlfile.read().decode('iso-8859-1')
I want to build a web scraper. Currently, I’m learning Python. This is the very basics!
Python Code
import urllib.request
import re
htmlfile = urllib.request.urlopen("http://basketball.realgm.com/")
htmltext = htmlfile.read()
title = re.findall('<title>(.*)</title>', htmltext)
print (htmltext)
Error:
File "C:Python33libre.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Use bytes literal as pattern:
title = re.findall(b'<title>(.*)</title>', htmltext)
or decode the retrieved data to string:
title = re.findall('<title>(.*)</title>', htmltext.decode('utf-8'))
(change utf-8
with appropriate encoding of the document)
You have to decode your data. Since the website in question says
charset=iso-8859-1
use that. utf-8 won’t work in this case.
htmltext = htmlfile.read().decode('iso-8859-1')