Requests-html can I get status codes of all requests (or selenium alternative)

Question:

I have the following code:

from requests_html import HTMLSession

ses = HTMLSession()
r = ses.get(MYURL)  # start a headless chrome browser and load MYURL
r.render(keep_page=True)  # This will now 'render' the html page
                          # which means like in a real browser trigger
                          # all requests for dependent links. 
                          # like .css, .js, .jpg, .gif

Calling render() triggers a load of requests for Javascript, bitmaps, etc.
Is there any way I can get a trace of status codes for each requests.
I’m mostly interested in 404, but 403 and 5xx errors might be interesting as well.

One use case would for example be:

• Go to a page or a sequence of pages

• Then report how many requests failed and which urls were accessed.

If this is not possible with requests-html but reasonably simple with selenium I can switch to selenium

Addendum: ugly work around 1:

I can setup logging to log into a file and set log level to debug.
Then I can try to parse logs of websockets.protocol: which contains strings like
{\"url\":\"https://my.server/example.gif\",\"status\":404,\"statusText\":\"...

Issues:

Activating log level DEBUG into a file seems to activate something else, because suddenly loads of debug info is also logged into stdout.

For example:

[I:pyppeteer.launcher] Browser listening on: ws://127.0.0.1:45945/devtools/browser/bc5ce097-e67d-455e-8a59-9a4c213263c1
[D:pyppeteer.connection.Connection] SEND: {"id": 1, "method": "Target.setDiscoverTargets", "params": {"discover": true}}

Also it’s not really fun to parse this in real time and to correlate it with the url I used in my code.

Addendum: ugly work around 2:

Even worse for correlating but nicer for parsing and just identifying 404s and works just if in control of the http server.

Parsing the logs of the log http server with nginx I can even setup a custom logger in csv format with just the data I’m interested in.

Addendum: ugly work around 3:

Using python logging (a dedicated handler and filter for pyppeteer) I can intercept a json string describing the responses from the pyppeteer.connection.CDPSession logger without having stderr become polluted.

The filter allows me to retrieve the data in real time.

This is still quite hackish. So looking for a better solution.

Asked By: gelonida

||

Answers:

Give the following a try and see if it’s what you’re after. It’s strictly a pyppeteer version (rather than requests_html) and relies on unexposed private variables so is fairly susceptible to breakage with version updates.

import asyncio
from pyppeteer import launch
from pyppeteer.network_manager import NetworkManager

def logit(event):
    req = event._request
    print("{0} - {1}".format(req.url, event._status))

async def main():
    browser = await launch({"headless": False})
    page = await browser.newPage()
    page._networkManager.on(NetworkManager.Events.Response, logit)
    await page.goto('https://www.google.com')
    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

Checking the source of requests_html the browser page object seems to be buried pretty deep– so getting at the NetworkManager isn’t exactly straightforward. If you really want it working from within requests_html it’s probably easiest to monkeypatch. Here’s an example:

import asyncio
from requests_html import HTMLSession, TimeoutError, HTML
from pyppeteer.network_manager import NetworkManager
from typing import Optional, Union

def logit(event):
    req = event._request
    print("{0} - {1}".format(req.url, event._status))

async def _async_render(self, *, url: str, script: str = None, scrolldown, sleep: int, wait: float, reload, content: Optional[str], timeout: Union[float, int], keep_page: bool, cookies: list = [{}]):
    """ Handle page creation and js rendering. Internal use for render/arender methods. """
    try:
        page = await self.browser.newPage()
        page._networkManager.on(NetworkManager.Events.Response, logit)

        # Wait before rendering the page, to prevent timeouts.
        await asyncio.sleep(wait)

        if cookies:
            for cookie in cookies:
                if cookie:
                    await page.setCookie(cookie)

        # Load the given page (GET request, obviously.)
        if reload:
            await page.goto(url, options={'timeout': int(timeout * 1000)})
        else:
            await page.goto(f'data:text/html,{self.html}', options={'timeout': int(timeout * 1000)})

        result = None
        if script:
            result = await page.evaluate(script)

        if scrolldown:
            for _ in range(scrolldown):
                await page._keyboard.down('PageDown')
                await asyncio.sleep(sleep)
        else:
            await asyncio.sleep(sleep)

        if scrolldown:
            await page._keyboard.up('PageDown')

        # Return the content of the page, JavaScript evaluated.
        content = await page.content()
        if not keep_page:
            await page.close()
            page = None
        return content, result, page
    except TimeoutError:
        await page.close()
        page = None
        return None


ses = HTMLSession()
r = ses.get('https://www.google.com')  # start a headless chrome browser and load MYURL
html =  r.html
html._async_render = _async_render.__get__(html, HTML)
html.render()
Answered By: clockwatcher

Using Selenium with Chrome it is also possible
There’s a Python package for that called Selenium-interceptor

Answered By: kaliiiiiiiii