Scraping javascript rendered HTML page in python
Question:
I am scraping a website using python, but the website is being rendered with javascript and all the links are coming from javascript. So when I use request.get(url)
it’s only giving the source code, not the other links that are generated with javascript. Is there any way to scrape those links automatically?
I also tried something like what’s described here: Ultimate guide for scraping JavaScript rendered web pages. But that is too slow to load.
So is there any faster way, using Mechanize, Phantom or some other library?
(Note: I have already tried using PyQ4, but that is too slow – I’m looking for a faster solution).
Answers:
You can Try PhantomJs or Casperjs
There are more node wrappers written over phantom and casperjs one of the most efficient and scalable is “ghost town”
One approach that may not be the fastest, but is most likely to succeed, is to use Selenium. The following function should do the job: Given an URL that holds javascript generated content, retrieve the dynmaic website and return its rendered html. Note that instead of Chrome you can use any other supported browser (e.g., Firefox, Safari or IE). Have a look at the docs:
https://www.selenium.dev/selenium/docs/api/py/api.html#
def retrieve_html_from_js_website(url, path_to_chrome_binary, threshold_waiting_time=4):
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
options = webdriver.ChromeOptions()
options.add_argument(f'user-agent=[{user_agent}]')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("detach", True)
with webdriver.Chrome(service=Service(path_to_chrome_binary), options=options) as driver:
# Note that there are many creative websites that use mechanisms
# to prevent browsers instantiated with Selenium from crawling
# their content. Some mechanisms are listed in the following:
# https://piprogramming.org/articles/How-to-make-Selenium-undetectable-and-stealth--7-Ways-to-hide-your-Bot-Automation-from-Detection-0000000017.html
driver.get(url)
time.sleep(threshold_waiting_time)
return driver.page_source
From here you can perform any parsing operation, such as extracting javascript generated URLs. For this particular task, I prefer using Beautiful Soup, although Selenium can do the job as well.
I am scraping a website using python, but the website is being rendered with javascript and all the links are coming from javascript. So when I use request.get(url)
it’s only giving the source code, not the other links that are generated with javascript. Is there any way to scrape those links automatically?
I also tried something like what’s described here: Ultimate guide for scraping JavaScript rendered web pages. But that is too slow to load.
So is there any faster way, using Mechanize, Phantom or some other library?
(Note: I have already tried using PyQ4, but that is too slow – I’m looking for a faster solution).
You can Try PhantomJs or Casperjs
There are more node wrappers written over phantom and casperjs one of the most efficient and scalable is “ghost town”
One approach that may not be the fastest, but is most likely to succeed, is to use Selenium. The following function should do the job: Given an URL that holds javascript generated content, retrieve the dynmaic website and return its rendered html. Note that instead of Chrome you can use any other supported browser (e.g., Firefox, Safari or IE). Have a look at the docs:
https://www.selenium.dev/selenium/docs/api/py/api.html#
def retrieve_html_from_js_website(url, path_to_chrome_binary, threshold_waiting_time=4):
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
options = webdriver.ChromeOptions()
options.add_argument(f'user-agent=[{user_agent}]')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("detach", True)
with webdriver.Chrome(service=Service(path_to_chrome_binary), options=options) as driver:
# Note that there are many creative websites that use mechanisms
# to prevent browsers instantiated with Selenium from crawling
# their content. Some mechanisms are listed in the following:
# https://piprogramming.org/articles/How-to-make-Selenium-undetectable-and-stealth--7-Ways-to-hide-your-Bot-Automation-from-Detection-0000000017.html
driver.get(url)
time.sleep(threshold_waiting_time)
return driver.page_source
From here you can perform any parsing operation, such as extracting javascript generated URLs. For this particular task, I prefer using Beautiful Soup, although Selenium can do the job as well.