Can't locate the correct beautifulsoup class and id combo
Question:
I have the following code
from bs4 import BeautifulSoup
import requests
URL = 'https://www.youtube.com/gaming/games'
response = requests.get(URL).text
soup = BeautifulSoup(response, 'html.parser')
elem = soup.find_all('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-game-details-renderer')
print(elem)
I am trying to isolate all the individual games on https://www.youtube.com/gaming/games.
I would like to just get the game name and how many people are watching. My issue is that I just can’t find the right " ", class_ = '' "
combo.
I’ve tried the following:
soup.find_all
:
('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-game-details-renderer')
('game', class_ = 'style-scope ytd-game-card-renderer')
(class_ = 'style-scope ytd-grid-renderer')
(id = 'items')
And many different variations.
If I just use find_all('div')
I get random data. I really think (id = 'items'
) is my solution, but aside from 'div'
I get the same response every time, a pair of brackets []
. I’ve also tried searching the individual div class objects I get in the results, but so far I’m getting the same []
results or random data that I don’t need.
If I use find instead of find_all (elem = soup.find(id='items'))
I get "None"
as a response.
I’m looking at the subscriber count, with an id of 'live-viewers-count'
, and it still prints []
.
What I’m looking at:
Answers:
You can’t really do this because this page is loaded dynamically with javascript.
BeautifulSoup doesn’t run javascript.
See, when right-clicking in the page and selecting show page source
, there is mostly just compiled javascript.
To scrape youtube, I’d either use Selenium to run a headless web-browser, or Js2Py if you need performance.
… or simply use youtube APIs : https://developers.google.com/youtube/v3/docs ^_^’
Update
Here’s how to traverse the game data JSON elements.
First, narrow down to game_data
, which is a list of JSON elements.
game_data = (
json.loads(main[20:-1])
['contents']
['twoColumnBrowseResultsRenderer']
['tabs'][0]
['tabRenderer']
['content']
['sectionListRenderer']
['contents'][0]
['itemSectionRenderer']
['contents'][0]
['shelfRenderer']
['content']
['gridRenderer']
['items']
)
Now iterate over the list. For each element, there’s a section of the data packet we’ll call details
, which contains game name and views.
Then use the paths I showed in my original answer to capture name and view count for each game.
for game in game_data:
details = (
game
['gameCardRenderer']
['game']
['gameDetailsRenderer']
)
game_name = details['title']['simpleText']
view_ct = details['liveViewersText']['runs'][0]['text']
print(f"Game: {game_name} / Views: {view_ct}")
Output
Game: Valorant / Views: 100K
Game: Grand Theft Auto V / Views: 61K
Game: Dota 2 / Views: 57K
Game: Minecraft / Views: 50K
# ...
Original answer
All of the data you need is stored as JSON in one of the <script>
tags, it’s just a pain to follow down the nested object to the fields you need. You can see it’s all there if you just look at soup.body
.
I had a few spare minutes just now, this should get you started – shows you how to get to the Game and Live Viewers count for the first game listed currently (‘Valorant’)
import json
# buried as JSON in a <script> inside <body>
main = soup.body.find_all('script')[13].contents[0]
This is how you get to game name (you can iterate instead of indexing [0] to get all the games):
# Game name
print('Game:', json.loads(main[20:-1])
['contents']
['twoColumnBrowseResultsRenderer']
['tabs'][0]
['tabRenderer']
['content']
['sectionListRenderer']
['contents'][0]
['itemSectionRenderer']
['contents'][0]
['shelfRenderer']
['content']
['gridRenderer']
['items'][0]
['gameCardRenderer']
['game']
['gameDetailsRenderer']
['title']
['simpleText']
)
Output
Game: Valorant
And this is Viewer Count:
print('Live Viewers:', json.loads(main[20:-1])
['contents']
['twoColumnBrowseResultsRenderer']
['tabs'][0]
['tabRenderer']
['content']
['sectionListRenderer']
['contents'][0]
['itemSectionRenderer']
['contents'][0]
['shelfRenderer']
['content']
['gridRenderer']
['items'][0]
['gameCardRenderer']
['game']
['gameDetailsRenderer']
['liveViewersText']
['runs'][0]
['text'])
Output
Live Viewers: 100K
I have the following code
from bs4 import BeautifulSoup
import requests
URL = 'https://www.youtube.com/gaming/games'
response = requests.get(URL).text
soup = BeautifulSoup(response, 'html.parser')
elem = soup.find_all('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-game-details-renderer')
print(elem)
I am trying to isolate all the individual games on https://www.youtube.com/gaming/games.
I would like to just get the game name and how many people are watching. My issue is that I just can’t find the right " ", class_ = '' "
combo.
I’ve tried the following:
soup.find_all
:
('a', class_ = 'yt-simple-endpoint focus-on-expand style-scope ytd-game-details-renderer')
('game', class_ = 'style-scope ytd-game-card-renderer')
(class_ = 'style-scope ytd-grid-renderer')
(id = 'items')
And many different variations.
If I just use find_all('div')
I get random data. I really think (id = 'items'
) is my solution, but aside from 'div'
I get the same response every time, a pair of brackets []
. I’ve also tried searching the individual div class objects I get in the results, but so far I’m getting the same []
results or random data that I don’t need.
If I use find instead of find_all (elem = soup.find(id='items'))
I get "None"
as a response.
I’m looking at the subscriber count, with an id of 'live-viewers-count'
, and it still prints []
.
What I’m looking at:
You can’t really do this because this page is loaded dynamically with javascript.
BeautifulSoup doesn’t run javascript.
See, when right-clicking in the page and selecting show page source
, there is mostly just compiled javascript.
To scrape youtube, I’d either use Selenium to run a headless web-browser, or Js2Py if you need performance.
… or simply use youtube APIs : https://developers.google.com/youtube/v3/docs ^_^’
Update
Here’s how to traverse the game data JSON elements.
First, narrow down to game_data
, which is a list of JSON elements.
game_data = (
json.loads(main[20:-1])
['contents']
['twoColumnBrowseResultsRenderer']
['tabs'][0]
['tabRenderer']
['content']
['sectionListRenderer']
['contents'][0]
['itemSectionRenderer']
['contents'][0]
['shelfRenderer']
['content']
['gridRenderer']
['items']
)
Now iterate over the list. For each element, there’s a section of the data packet we’ll call details
, which contains game name and views.
Then use the paths I showed in my original answer to capture name and view count for each game.
for game in game_data:
details = (
game
['gameCardRenderer']
['game']
['gameDetailsRenderer']
)
game_name = details['title']['simpleText']
view_ct = details['liveViewersText']['runs'][0]['text']
print(f"Game: {game_name} / Views: {view_ct}")
Output
Game: Valorant / Views: 100K
Game: Grand Theft Auto V / Views: 61K
Game: Dota 2 / Views: 57K
Game: Minecraft / Views: 50K
# ...
Original answer
All of the data you need is stored as JSON in one of the <script>
tags, it’s just a pain to follow down the nested object to the fields you need. You can see it’s all there if you just look at soup.body
.
I had a few spare minutes just now, this should get you started – shows you how to get to the Game and Live Viewers count for the first game listed currently (‘Valorant’)
import json
# buried as JSON in a <script> inside <body>
main = soup.body.find_all('script')[13].contents[0]
This is how you get to game name (you can iterate instead of indexing [0] to get all the games):
# Game name
print('Game:', json.loads(main[20:-1])
['contents']
['twoColumnBrowseResultsRenderer']
['tabs'][0]
['tabRenderer']
['content']
['sectionListRenderer']
['contents'][0]
['itemSectionRenderer']
['contents'][0]
['shelfRenderer']
['content']
['gridRenderer']
['items'][0]
['gameCardRenderer']
['game']
['gameDetailsRenderer']
['title']
['simpleText']
)
Output
Game: Valorant
And this is Viewer Count:
print('Live Viewers:', json.loads(main[20:-1])
['contents']
['twoColumnBrowseResultsRenderer']
['tabs'][0]
['tabRenderer']
['content']
['sectionListRenderer']
['contents'][0]
['itemSectionRenderer']
['contents'][0]
['shelfRenderer']
['content']
['gridRenderer']
['items'][0]
['gameCardRenderer']
['game']
['gameDetailsRenderer']
['liveViewersText']
['runs'][0]
['text'])
Output
Live Viewers: 100K