Python beautifulsoup for web extraction loads data before the actual page of tags loads

Question

I m using this code for scraping some data from the link https://website.grader.com/results/www.dubizzle.com

The code is as below

#!/usr/bin/python
import urllib
from bs4 import BeautifulSoup
from dateutil.parser import parse
from datetime import timedelta

import MySQLdb
import re
import pdb
import sys
import string

def getting_urls_of_all_pages(): 
    url_rent_flat='https://website.grader.com/results/dubizzle.com'
    every_property_in_a_page_data_extraction(url_rent_flat) 


def every_property_in_a_page_data_extraction(url):
    
    htmlfile=urllib.urlopen(url).read()
    soup=BeautifulSoup(htmlfile)

    print soup

    Sizeofweb=""
    try:
        
        Sizeofweb= soup.find('span', {'data-reactid': ".0.0.3.0.0.3.$0.1.1.0"}).text
        print Sizeofweb.get_text().encode("utf-8")

    except StandardError as e:
        error="Error was {0}".format(e)
        print error
        
getting_urls_of_all_pages()

The part of the html which I am extracting is as below

Snap:
https://www.dropbox.com/s/7dwbaiyizwa36m6/5.PNG?dl=0

Code:

<div class="result-value" data-reactid=".0.0.3.0.0.3.$0.1.1">
<span data-reactid=".0.0.3.0.0.3.$0.1.1.0">1.1</span>
<span class="result-value-unit" data-reactid=".0.0.3.0.0.3.$0.1.1.1">MB</span>
</div>

Problem:
Problem is that the website takes around 10-15 seconds to load the html source file which has the tags which I want to extract as mentioned in the code.

When the code uses the line htmlfile=urllib.urlopen(url).read() to load the html of the page, I think it loads html of preload of the link which is there before 10-15 seconds.

How can I make a pause in the code and let it load the data after 15+ seconds so the right html with the tags I want to extract loads in the program?

Asked By: info

||

Source

Answer 1

Using Selenium WebDriver will solve your problem. Specifically, it has a way to wait from specified number of seconds to process further. Something like the following should work.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

driver = webdriver.Firefox()
driver.get(baseurl)

try:
    wait = WebDriverWait(driver, 60) 
    element = wait.until(
      ec.element_to_be_clickable(...)
    )
finally:
    driver.quit()

Answered By: user6399774

Python beautifulsoup for web extraction loads data before the actual page of tags loads

Question:

Answers: