how to scrape websites that have loaders?

Question:

i’m trying to scrape the website that contains loading screens. when i browse the website it shows loading.. for a sec and then it loads up. But the problem is when i try to scrape it using scrapy it gives me nothing (probably because of that loading). can i solve the problem using scrapy or should i use some other tools?
here’s the link to the website if you wanna see https://www.graana.com/project/601/lotus-lake-towers

Answers:

Network consoleAs it is sending a GET request to get information about the property , you should mimic the same in your code. (You can observe the GET call under console -> Network -> XHR )

    # -*- coding: utf-8 -*-
    import scrapy


    class GranaSpider(scrapy.Spider):
        name = 'grana'
        allowed_domains = 'www.graana.com'
        start_urls = ['https://www.graana.com/api/area/slug/601']

        def parse(self, response):
    #        for url in allurlList:
            scrapy.http.Request(response.url, method='GET' , dont_filter=False)
            print(response.body)
#convert json response to array and save to your storage system

Output is in json format, convert it to your convenience.

enter image description here

Answered By: Deepa MG

I know this question is old and already answered but I wanted to share my solution after encountering a similar problem. The accepted answer was not helpful to me because I was not using scrapy.

I wanted to scrape a website that first displays a loading page and then displays the actual page content.

Here’s an example of such a website :
GIF showing the loading page animation of a website

The requests library will not work for such websites. In my experience, request.get(URL, headers=HEADERS) simply times out .

Solution

Use Selenium.

  • First you need to know approximately how long the loading page animation lasts. In the above website, it takes around 3 seconds.
  • The trick is to simply sleep your program for the duration of the animation after navigating to the website with driver.get(URL).
  • By the time the program finishes sleeping, the loading animation will be over so we can safely extract the HTML of the actual page content using driver.page_source.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

# the following options are only for setup purposes
chrome_options = Options()
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome(options=chrome_options)

URL = "https://www.myjob.mu/ShowResults.aspx?Keywords=&Location=&Category=39&Recruiter=Company&SortBy=MostRecent"

driver.get(URL)
time.sleep(5) # any number > 3 should work fine
html = driver.page_source
print(html)

Beautifulsoup library can then be used for parsing the html.

Answered By: Bunny
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.