How to extract hidden html content with scrapy?

Question:

I’m using scrapy (on PyCharm v2020.1.3) to build a spider that crawls this webpage: "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas", i want to extract the products names, and the breadcrumb in a list format, and save the results in a csv file.
I tried the following code but it returns empty brackets [] , after i’ve inspected the html code i discovred that the content is hidden in angularjs format.
If someone has a solution for that it would be great
Thank you

import scrapy

class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas']

def parse(self, response):
    product = response.css('a.shelfProductTile-descriptionLink::text').extract()
    yield "productnames"
Asked By: Leon Ben

||

Answers:

You won’t be able to get the desired products through parsing the HTML. It is heavily javascript orientated and therefore scrapy wont parse this.

The simplest way to get the product names, I’m not sure what you mean by breadcrumbs is to re-engineer the HTTP requests. The woolworths website generates the product details via an API. If we can mimick the request the browser makes to obtain that product information we can get the information in a nice neat format.

First you have to set within settings.py ROBOTSTXT_OBEY = False. Becareful about protracted scrapes of this data because your IP will probably get banned at some point.

Code Example

import scrapy


class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['woolworths.com']


    data = {
    'excludeUnavailable': 'true',
    'source': 'RR-Best Sellers'}

    def start_requests(self):
        url = 'https://www.woolworths.com.au/apis/ui/products/58520,341057,305224,70660,208073,69391,69418,65416,305227,305084,305223,427068,201688,427069,341058,305195,201689,317793,714860,57624'
        yield scrapy.Request(url=url,meta=self.data,callback=self.parse)
    def parse(self, response):
        data = response.json()

        for a in data: 
            yield { 
                'name': a['Name'],

            }

Explanation

We start of with our defined url in start_requests. This URL is the specific URL of the API woolworth uses to obtain information for iced tea. For any other link on woolworths the part of the URL after /products/ will be specific to that part of the website.

The reason why we’re using this, is because using browser activity is slow and prone to being brittle. This is fast and the information we can get is usually highly structured much better for scraping.

So how do we get the URL you may be asking ? You need to inspect the page, and find the correct request. If you click on network tools and then reload the website. You’ll see a bunch of requests. Usually the largest sized request is the one with all the data. Clicking that and clicking preview gives you a box on the right hand side. This gives all the details of the products.

Network Requests

In this next image, you can see a preview of the product data

Preview

We can then get the request URL and anything else from this request.

Requests URL needed

I will often copy this request as a CURL (Bash Command) as seen here

Coping the CURL command

And enter it into curl.trillworks.com. This can convert CURL to python. Giving you a nice formatted headers and any other data needed to mimick the request.

Now putting this into jupyter and playing about, you actually only need the params NOT the headers which is much better.

So back to the code. We do a request, using meta argument we can pass on the data, remember because it’s outside the function we have to use self.data and then specifying the callback to parse.

We can use the response.json() method to convert the JSON object to a set of python dictionaries corresponding to each product. YOU MUST have scrapy V2.2 to use this method. Other you could use data = json.loads(response.text), but you’ll have put to import json at the top of the script.

From the preview and playing about with the json in requests we can see these python dictionaries are actually within a list and so we can use a for loop to loop round each product, which is what we are doing here.

We then yield a dictionary to extract the data, a refers to each products which is it’s own dictionary and a['Name'] refers to that specific python dictionary key ‘Name’ and giving us the correct value. To get a better feel for this, I always use requests package in jupyter to figure out the correct way to get the data I want.

The only thing left to do is to use scrapy crawl test -o products.csv to output this to a CSV file.

I can’t really help you more than this until you specify any other data you want from this page. Please remember that you’re going against what the site wants you to scrape, but also any other pages on that website you will need to find out the specific URL to the API to get those products. I have given you the way to do this, I suggest if you want to automate this it would be worth your while trying to struggle with this. We are hear to help but an attempt on your part is how you’re going to learn coding.

Additional Information on the Approach of Dynamic Content

There is a wealth of information on this topic. Here are some guidelines to think about when looking at javascript orientated websites. The default is you should try re-engineer the requests the browser makes to load the pages information. This is what the javascript in this site and many other sites is doing, it’s providing a dynamic way without reloading the page to display new information by making an HTTP request. If we can mimic that request, we can get the information we desire. This is the most efficent way to get dynamic content.

In order of preference

  1. Re-engineering the HTTP requests
  2. Scrapy-splash
  3. Scrapy_selenium
  4. importing selenium package into your scripts

Scrapy-splash is slightly better than the selenium package, as it pre-renders the page, giving you access to the selectors with the data. Selenium is slow, prone to errors but will allow you to mimic browser activity.

There are multiple ways to include selenium into your scripts see down below as an overview.

Recommended Reading/Research

  1. Look at the scrapy documentation with regard to dynamic content here
    This will give you an overview of the steps to handling dynamic content. I will say generally speaking selenium should be thought of as a last resort. It’s pretty inefficient when doing larger scale scraping.

  2. If you are consider adding in the selenium package into your script. This might be the lower barrier of entry to getting your script working but not necessarily that efficient. At the end of the day scrapy is a framework but there is a lot of flexibility in adding in 3rd party packages. The spider scripts are just a python class importing the scrapy architecture in the background. As long as you’re mindful of the response and translating some of the selenium to work with scrapy, you should be able to input selenium into your scripts. I would this solution is probably the least efficient though.

  3. Consider using scrapy-splash, splash pre-renders the page and allows for you to add in javascript execution. Docs are here and a good article from scrapinghub here

  1. Scrapy-selenium is a package with a custom scrapy downloader middleware that allows you to do selenium actions and execute javascript. Docs here You’ll need to have a play around to get the login in procedure from this, it doesn’t have the same level of detail as the selenium package itself.
Answered By: AaronS

Re-visiting this question in 2023. As mentioned above, using requests, beautiful soup or scrapy will be challenging for this particular website because of how it’s structured.

However, one way of circumventing this is by using the playwright package. Here’s an example of how you could lift the product names, prices and unit costs from the page.

import csv

from playwright.sync_api import sync_playwright


with sync_playwright() as playwright:
    browser = playwright.webkit.launch(headless=False)  # or 'chromium' or 'firefox'
    page = browser.new_page()
    page.goto(
        "https://www.woolworths.com.au/shop/browse/drinks/cordials-juices-iced-teas/iced-teas"
    )

    # Wait for page to load Javascript
    page.wait_for_selector("span.cartControls-addCart")

    # Fetch products on page and total count
    item = page.locator("section.product-tile-v2")
    count = item.count()

    result = []
    out_of_stock = "div.product-tile-unavailable-tag.ng-star-inserted"
    for i in range(count):
        # Fetch single product
        product = item.nth(i)

        # Will skip if product is out of stock
        if product.locator(f"{out_of_stock}").count() == 0:
            name = product.locator("a.product-title-link").inner_text()
            price = product.locator("div.primary").inner_text()
            unit = product.locator("span.price-per-cup").inner_text()
            result.append((name, price, unit))

The ‘breadcrumbs’ could be extracted using the same concept in the script, where you could grab the container holding the breadcrumbs and iterating over it.

Removing the headless option will prevent the browser from visually opening on your Desktop.

Answered By: cool d