Scraping javascript rendered HTML page in python

Question:

I am scraping a website using python, but the website is being rendered with javascript and all the links are coming from javascript. So when I use request.get(url) it’s only giving the source code, not the other links that are generated with javascript. Is there any way to scrape those links automatically?

I also tried something like what’s described here: Ultimate guide for scraping JavaScript rendered web pages. But that is too slow to load.

So is there any faster way, using Mechanize, Phantom or some other library?
(Note: I have already tried using PyQ4, but that is too slow – I’m looking for a faster solution).

Asked By: user6184405

||

Answers:

You can Try PhantomJs or Casperjs

There are more node wrappers written over phantom and casperjs one of the most efficient and scalable is “ghost town”

Answered By: Harshit Anand

One approach that may not be the fastest, but is most likely to succeed, is to use Selenium. The following function should do the job: Given an URL that holds javascript generated content, retrieve the dynmaic website and return its rendered html. Note that instead of Chrome you can use any other supported browser (e.g., Firefox, Safari or IE). Have a look at the docs:

https://www.selenium.dev/selenium/docs/api/py/api.html#

def retrieve_html_from_js_website(url, path_to_chrome_binary, threshold_waiting_time=4):
    import time
    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service

    user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
    options = webdriver.ChromeOptions()
    options.add_argument(f'user-agent=[{user_agent}]')
    options.add_argument('--disable-blink-features=AutomationControlled')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option("detach", True)

    with webdriver.Chrome(service=Service(path_to_chrome_binary), options=options) as driver:
        # Note that there are many creative websites that use mechanisms 
        # to prevent browsers instantiated with Selenium from crawling 
        # their content. Some mechanisms are listed in the following:
        # https://piprogramming.org/articles/How-to-make-Selenium-undetectable-and-stealth--7-Ways-to-hide-your-Bot-Automation-from-Detection-0000000017.html
        driver.get(url)
        time.sleep(threshold_waiting_time)
        return driver.page_source

From here you can perform any parsing operation, such as extracting javascript generated URLs. For this particular task, I prefer using Beautiful Soup, although Selenium can do the job as well.

https://beautiful-soup-4.readthedocs.io/en/latest/

Answered By: NeuroMorphing
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.