Why do I get a "Connection aborted" error when trying to crawl a specific website?
Question:
I wrote a Web crawler in Python 2.7, but a specific site cannot be downloaded although it can be viewed in browser.
My code is as following:
# -*- coding: utf-8 -*-
import requests
# OK
url = 'http://blog.ithome.com.tw/'
url = 'http://7club.ithome.com.tw/'
url = 'https://member.ithome.com.tw/'
url = 'http://ithome.com.tw/'
url = 'http://weekly.ithome.com.tw'
# NOT OK
url = 'http://download.ithome.com.tw'
url = 'http://apphome.ithome.com.tw/'
url = 'http://ithelp.ithome.com.tw/'
try:
response = requests.get(url)
print 'OK!'
print 'response.status_code: %s' %(response.status_code)
except Exception, e:
print 'NOT OK!'
print 'Error: %s' %(e)
print 'DONE!'
print 'response.status_code: %s' %(response.status_code)
Each time I have tried I get this error:
C:Python27python.exe "E:/python crawler/test_ConnectionFailed.py"
NOT OK!
Error: ('Connection aborted.', BadStatusLine("''",))
DONE!
Traceback (most recent call last):
File "E:/python crawler/test_ConnectionFailed.py", line 29, in <module>
print 'response.status_code: %s' %(response.status_code)
NameError: name 'response' is not defined
Process finished with exit code 1
Why is this happening and how can I fix it?
SOLVED! I just use another proxy software, then OK!
Answers:
The connection could not be resolved for those domains, doing a normal ping operation on the urls yield this result
Command to run:
ping http://download.ithome.com.tw
Result
The host could not be resolved
No response and hence no status line which in normal cases would contain a status code.
I found that using urllib2 library better than request.
import urllib2
def get_page(url):
request = urllib2.Request(url)
request = urllib2.urlopen(request)
data = request.read()
return data
url = "http://blog.ithome.com.tw/"
print get_page(url)
I wrote a Web crawler in Python 2.7, but a specific site cannot be downloaded although it can be viewed in browser.
My code is as following:
# -*- coding: utf-8 -*-
import requests
# OK
url = 'http://blog.ithome.com.tw/'
url = 'http://7club.ithome.com.tw/'
url = 'https://member.ithome.com.tw/'
url = 'http://ithome.com.tw/'
url = 'http://weekly.ithome.com.tw'
# NOT OK
url = 'http://download.ithome.com.tw'
url = 'http://apphome.ithome.com.tw/'
url = 'http://ithelp.ithome.com.tw/'
try:
response = requests.get(url)
print 'OK!'
print 'response.status_code: %s' %(response.status_code)
except Exception, e:
print 'NOT OK!'
print 'Error: %s' %(e)
print 'DONE!'
print 'response.status_code: %s' %(response.status_code)
Each time I have tried I get this error:
C:Python27python.exe "E:/python crawler/test_ConnectionFailed.py"
NOT OK!
Error: ('Connection aborted.', BadStatusLine("''",))
DONE!
Traceback (most recent call last):
File "E:/python crawler/test_ConnectionFailed.py", line 29, in <module>
print 'response.status_code: %s' %(response.status_code)
NameError: name 'response' is not defined
Process finished with exit code 1
Why is this happening and how can I fix it?
SOLVED! I just use another proxy software, then OK!
The connection could not be resolved for those domains, doing a normal ping operation on the urls yield this result
Command to run:
ping http://download.ithome.com.tw
Result
The host could not be resolved
No response and hence no status line which in normal cases would contain a status code.
I found that using urllib2 library better than request.
import urllib2
def get_page(url):
request = urllib2.Request(url)
request = urllib2.urlopen(request)
data = request.read()
return data
url = "http://blog.ithome.com.tw/"
print get_page(url)