Both selenium and bs4 cannot find div in page
Question:
I am trying to scrape a Craigslist results page and neither bs4 or selenium can find the elements in the page even though I can see them on inspection using dev tools. The results are in list items with class cl-search-result
, but it seems the soup returned has none of the results.
This is my script so far. It looks like even the soup that is returned is not the same as the html I see when I inspect with dev tools. I am expecting this script to return 42 items, which is the number of search results.
Here is the script:
import time
import datetime
from collections import namedtuple
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import Select
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import ElementNotInteractableException
from bs4 import BeautifulSoup
import pandas as pd
import os
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0'
firefox_driver_path = os.path.join(os.getcwd(), 'geckodriver.exe')
firefox_service = Service(firefox_driver_path)
firefox_option = Options()
firefox_option.set_preference('general.useragent.override', user_agent)
browser = webdriver.Firefox(service=firefox_service, options=firefox_option)
browser.implicitly_wait(7)
url = 'https://baltimore.craigslist.org/search/sss#search=1~list~0~0'
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
print(soup)
posts_html= soup.find_all('li', {'class': 'cl-search-result'})
print('Collected {0} listings'.format(len(posts_html)))
Answers:
The following code worked for me. It printed: Collected 120 listings
url = 'https://baltimore.craigslist.org/search/sss#search=1~list~0~0'
browser = webdriver.Chrome()
browser.get(url)
sleep(3)
soup = BeautifulSoup(browser.page_source, 'html.parser')
posts_html= soup.find_all('li', {'class': 'cl-search-result'})
print('Collected {0} listings'.format(len(posts_html)))
Edit 1: The get
Method’s Wait Flaw
As per the selenium documentation, the webdriver get
method "…will wait until the page has fully loaded (that is, the ‘onload’ event has fired)"…1, but "…if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded"1. Because of this it’s generally recommend to either use time.sleep()
or the WebDriverWait
class to give enough time for all of the asynchronous requests to be completed.
I am trying to scrape a Craigslist results page and neither bs4 or selenium can find the elements in the page even though I can see them on inspection using dev tools. The results are in list items with class cl-search-result
, but it seems the soup returned has none of the results.
This is my script so far. It looks like even the soup that is returned is not the same as the html I see when I inspect with dev tools. I am expecting this script to return 42 items, which is the number of search results.
Here is the script:
import time
import datetime
from collections import namedtuple
import selenium.webdriver as webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import Select
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import ElementNotInteractableException
from bs4 import BeautifulSoup
import pandas as pd
import os
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0'
firefox_driver_path = os.path.join(os.getcwd(), 'geckodriver.exe')
firefox_service = Service(firefox_driver_path)
firefox_option = Options()
firefox_option.set_preference('general.useragent.override', user_agent)
browser = webdriver.Firefox(service=firefox_service, options=firefox_option)
browser.implicitly_wait(7)
url = 'https://baltimore.craigslist.org/search/sss#search=1~list~0~0'
browser.get(url)
soup = BeautifulSoup(browser.page_source, 'html.parser')
print(soup)
posts_html= soup.find_all('li', {'class': 'cl-search-result'})
print('Collected {0} listings'.format(len(posts_html)))
The following code worked for me. It printed: Collected 120 listings
url = 'https://baltimore.craigslist.org/search/sss#search=1~list~0~0'
browser = webdriver.Chrome()
browser.get(url)
sleep(3)
soup = BeautifulSoup(browser.page_source, 'html.parser')
posts_html= soup.find_all('li', {'class': 'cl-search-result'})
print('Collected {0} listings'.format(len(posts_html)))
Edit 1: The get
Method’s Wait Flaw
As per the selenium documentation, the webdriver get
method "…will wait until the page has fully loaded (that is, the ‘onload’ event has fired)"…1, but "…if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded"1. Because of this it’s generally recommend to either use time.sleep()
or the WebDriverWait
class to give enough time for all of the asynchronous requests to be completed.