Python web scraping gives wrong source code
Question:
I want to extract some data from Amazon(link in the following code)
Here is my code:
import urllib2
url="http://www.amazon.com/s/ref=sr_nr_n_11?rh=n%3A283155%2Cn%3A%2144258011%2Cn%3A2205237011%2Cp_n_feature_browse-bin%3A2656020011%2Cn%3A173507&bbn=2205237011&sort=titlerank&ie=UTF8&qid=1393984161&rnid=1000"
webpage=urllib2.urlopen(url).read()
doc=open("test.html","w")
doc.write(webpage)
doc.close()
When I open the test.html, the content of my page is different from the website in the Internet.
Answers:
The page involves javascript execution.
urllib2.urlopen(..).read()
simply read the url content. So they are different.
To get same content, you need to use library that can handle javascript.
For example, following code uses selenium
:
from selenium import webdriver
url = 'http://www.amazon.com/s/ref=sr_nr_n_11?...161&rnid=1000'
driver = webdriver.Firefox()
driver.get(url)
with open('test.html', 'w') as f:
f.write(driver.page_source.encode('utf-8'))
driver.quit()
To complete falsetru’s answer:
another solution is to use python-ghost. It is based on Qt. It’s much heavier to install, so I advice Selenium too.
Using Firefox will open it up on script execution. To not have it on your way, use PhantomJS:
apt-get install nodejs # you get npm, the Node Package Manager
npm install -g phantomjs # install globally
[…]
driver = webdriver.PhantomJS()
I want to extract some data from Amazon(link in the following code)
Here is my code:
import urllib2
url="http://www.amazon.com/s/ref=sr_nr_n_11?rh=n%3A283155%2Cn%3A%2144258011%2Cn%3A2205237011%2Cp_n_feature_browse-bin%3A2656020011%2Cn%3A173507&bbn=2205237011&sort=titlerank&ie=UTF8&qid=1393984161&rnid=1000"
webpage=urllib2.urlopen(url).read()
doc=open("test.html","w")
doc.write(webpage)
doc.close()
When I open the test.html, the content of my page is different from the website in the Internet.
The page involves javascript execution.
urllib2.urlopen(..).read()
simply read the url content. So they are different.
To get same content, you need to use library that can handle javascript.
For example, following code uses selenium
:
from selenium import webdriver
url = 'http://www.amazon.com/s/ref=sr_nr_n_11?...161&rnid=1000'
driver = webdriver.Firefox()
driver.get(url)
with open('test.html', 'w') as f:
f.write(driver.page_source.encode('utf-8'))
driver.quit()
To complete falsetru’s answer:
another solution is to use python-ghost. It is based on Qt. It’s much heavier to install, so I advice Selenium too.
Using Firefox will open it up on script execution. To not have it on your way, use PhantomJS:
apt-get install nodejs # you get npm, the Node Package Manager
npm install -g phantomjs # install globally
[…]
driver = webdriver.PhantomJS()