screen-scraping

Executing Javascript from Python

Executing Javascript from Python Question: I have HTML webpages that I am crawling using xpath. The etree.tostring of a certain node gives me this string: <script> <!– function escramble_758(){ var a,b,c a=’+1 ‘ b=’84-‘ a+=’425-‘ b+=’7450′ c=’9’ document.write(a+c+b) } escramble_758() //–> </script> I just need the output of escramble_758(). I can write a regex to …

Total answers: 7

Can scrapy be used to scrape dynamic content from websites that are using AJAX?

Can scrapy be used to scrape dynamic content from websites that are using AJAX? Question: I have recently been learning Python and am dipping my hand into building a web-scraper. It’s nothing fancy at all; its only purpose is to get the data off of a betting website and have this data put into Excel. …

Total answers: 10

Scraping and parsing Google search results using Python

Scraping and parsing Google search results using Python Question: I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of “About” pages from the Internet. With some further research, I got some choices to go ahead with both …

Total answers: 10

Headless Browser for Python (Javascript support REQUIRED!)

Headless Browser for Python (Javascript support REQUIRED!) Question: I need a headless browser which is fairly easy to use (I am still fairly new to Python and programming in general) which will allow me to navigate to a page, log into a form that requires Javascript, and then scrape the resulting web page by searching …

Total answers: 6

Best way for a beginner to learn screen scraping by Python

Best way for a beginner to learn screen scraping by Python Question: This might be one of those questions that are difficult to answer, but here goes: I don’t consider my self programmer – but I would like to 🙂 I’ve learned R, because I was sick and tired of spss, and because a friend …

Total answers: 6

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt" Question: Is there a way to get around the following? httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt Is the only way around this to contact the site-owner (barnesandnoble.com).. i’m building a site that would bring them more sales, not sure why they would …

Total answers: 8

Why is python decode replacing more than the invalid bytes from an encoded string?

Why is python decode replacing more than the invalid bytes from an encoded string? Question: Trying to decode an invalid encoded utf-8 html page gives different results in python, firefox and chrome. The invalid encoded fragment from test page looks like ‘PREFIXxe3xabSUFFIX’ >>> fragment = ‘PREFIXxe3xabSUFFIX’ >>> fragment.decode(‘utf-8’, ‘strict’) … UnicodeDecodeError: ‘utf8’ codec can’t decode …

Total answers: 4

Web scraping with Python

Web scraping with Python Question: I’d like to grab daily sunrise/sunset times from a web site. Is it possible to scrape web content with Python? what are the modules used? Is there any tutorial available? Asked By: eozzy || Source Answers: You can use urllib2 to make the HTTP requests, and then you’ll have web …

Total answers: 10

How to download any(!) webpage with correct charset in python?

How to download any(!) webpage with correct charset in python? Question: Problem When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up. People usually use some rudimentary technique to detect the encoding. They either use …

Total answers: 7