Why is the html content I got from inspector different from what I got from Request?

Question:

Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines

import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')

The code I have above.

It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?


(Update) Attempt with selenium:

from selenium import webdriver
import time
path = "C:Program Files (x86)chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()

Update 2:(loaded with devtool)
enter image description here

Asked By: DJ-coding

||

Answers:

Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.

To achieve this you will have to look at something like selenium which controls a browser via python or other languages… There is a seperate version of selenium for each browser i.e firefox, chrome etc.

Personally I use chrome so the drivers can be found here…

https://chromedriver.chromium.org/downloads

  1. download the correct driver for your version of chrome
  2. install selenium via pip
  3. create a scrape.py file and put the driver in the same folder.

then to get the html string to use with bs4

from selenium import webdriver
import time

# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()

You should then be able to use the html variable with BS4

Answered By: Lewis Morris

I’ll actually turn my comment to an answer because it is a solution to your problem :

As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.

This could be particularly interesting for you : enter image description here

You don’t need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough 😉

There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'

You get an error if you don’t add the tenant header.

And that’s it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.

Answered By: Loïc