How to scrape through Single page Application websites in python using bs4

Question:

I am scraping players name through the NBA website. The player’s name webpage is designed using a single page application. The Players are distributed across several pages in alphabetical order. I am unable to extract the names of all the players.
Here is the link: https://in.global.nba.com/playerindex/

from selenium import webdriver
from bs4 import BeautifulSoup

class make():
    def __init__(self):
        self.first=""
        self.last=""

driver= webdriver.PhantomJS(executable_path=r'E:DownloadsCompressedphantomjs-2.1.1-windowsbinphantomjs.exe')

driver.get('https://in.global.nba.com/playerindex/')

html_doc = driver.page_source


soup = BeautifulSoup(html_doc,'lxml')

names = []

layer = soup.find_all("a",class_="player-name ng-isolate-scope")
for a in layer:
    span = a.find("span",class_="ng-binding")
    thing = make()
    thing.first = span.text
    spans = a.find("span",class_="ng-binding").find_next_sibling()
    thing.last = spans.text
    names.append(thing)
Asked By: Saurabh Rawat

||

Answers:

When dealing with SPAs, you shouldn’t try to extract info from DOM, because the DOM is incomplete without running a JS-capable browser to populate it with data. Open up the page source, and you’ll see the page HTML doesn’t have the data you need.

But most SPAs load their data using XHR requests. You can monitor network requests in Developer Console (F12) to see the requests being made during page load.

Here https://in.global.nba.com/playerindex/ loads player list from https://in.global.nba.com/stats2/league/playerlist.json?locale=en

Simulate that request yourself, then pick whatever you need. Inspect the request headers to figure out what you need to send with the request.

import requests

if __name__ == '__main__':
    page_url = 'https://in.global.nba.com/playerindex/'
    s = requests.Session()
    s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0'}

    # visit the homepage to populate session with necessary cookies
    res = s.get(page_url)
    res.raise_for_status()

    json_url = 'https://in.global.nba.com/stats2/league/playerlist.json?locale=en'
    res = s.get(json_url)
    res.raise_for_status()
    data = res.json()

    player_names = [p['playerProfile']['displayName'] for p in data['payload']['players']]
    print(player_names)

output:

['Steven Adams', 'Bam Adebayo', 'Deng Adel', 'LaMarcus Aldridge', 'Kyle Alexander', 'Nickeil Alexander-Walker', ...

Dealing with auth

One thing to watch out for is that some websites require an authentication token to be sent with requests. You can see it in the API requests if it’s present.

If you’re building a scraper that needs to be functional in the long(er) term, you might want to make the script more robust by extracting the token from the page and including it in requests.

This token (mostly a JWT token, starts with ey...) usually sits somewhere in the HTML, encoded as JSON. Or it is sent to the client as a cookie, and the browser attaches it to the request, or in a header. In short, it could be anywhere. Scan the requests & responses to figure out where the token is coming from and how you can retrieve it yourself.

...
<script>
const state = {"token": "ey......", ...};
</script>
import json
import re

res = requests.get('url/to/page')

# extract the token from the page. Here `state` is an object serialized as JSON,
# we take everything after `=` sign until the semicolon and deserialize it
state = json.loads(re.search(r'const state = (.*);', res.text).group(1))
token = state['token']

res = requests.get('url/to/api/with/auth', headers={'authorization': f'Bearer {token}'})
Answered By: abdusco