What is the fastest/ most lightweight way of getting html after javascript have excuted?

Question

The problem is that youtube API for searching is very limiting, so i’ve resorted to webscraping the search result page. So far i’ve tried to use seleiunm to load the page and get the html, but it have quite a bit of delay when starting up.

Without Javascript, youtube search result page will not get generated properly, so I cant just run a get request on the URL.

Is there any other ways to get the rendered search result page?

My code right now

    def search(self, query):
        try:

            self.driver.get('https://www.youtube.com/results?search_query={}'.format(str(query)))

            self.wait.until(self.visible((By.ID, "video-title")))
            elements=self.driver.find_elements(By.XPATH,"//*[@id="video-title"]")
            results = []
            for element in elements:
                results.append([element.text, element.get_attribute('href')])
            return results
        except:
            return []

This is part of a class that reuses the same seleiunm instance until the program shuts down

SOLUTION

import requests



    def search(self, query):
        re = requests.get('https://www.youtube.com/results?search_query={}'.format(str(query).replace(' ', '+')))
        index = 1
        j = 0
        result = []
        while j <= 40: #results are located at every 4 videoId tag
            newindex = re.text.find('"videoId":"', index)
            videonameindex = re.text.find('{"text"', newindex)
            index = newindex +1
            if j%4 == 0:
                
                videoname = re.text[videonameindex+8:videonameindex+100]
                name = videoname.split('}],')[0].replace('"','')
                videoId = re.text[newindex:newindex+30].split(':')[1].split(',')[0].replace('"','')
                # make sure the video ID is valid
                if len(videoId) != 11:
                    continue
                url = f'https://www.youtube.com/watch?v={videoId}'
                result.append([name, url])
            j += 1
        self.conn.commit()
        return result

A bit longer code, but now there is no long wait for selenium to load up, and no need to wait for javascript to finish executing

Thanks to @Benjamin Loison

Asked By: eroc1234

||

Source

Answer 1

The fastest way with selenium is to use "eager" page load strategy and wait for the selector.

But in my experience you can probably do around 2x faster by switching to playwright (async)

Answered By: pguardiario

Answer 2

If you proceed to curl https://www.youtube.com/results?search_query=test, you will realize that the results data you are looking for are part of the JavaScript variable ytInitialData. I would recommend you to just fetch this HTML file and parse its JavaScript variable ytInitialData. In that way you don’t need to use any JavaScript interpreter such as Selenium that is particularly slow as it isn’t required.

Note: I am developing an open-source alternative to the YouTube Data API v3 using this method. I have an endpoint similar to what you are looking for by the way.

Answered By: Benjamin Loison

Answer 3

To extract the href attributes from the video titles on a fully rendered YouTube search result page using Selenium, a huge improvement rather the best approach would be to induce WebDriverWait for the visibility_of_all_elements_located() and using List Comprehension you can use the following solution:

def search(self, query):
    try:
        self.driver.get('https://www.youtube.com/results?search_query={}'.format(str(query)))
        results = []
        results.append([my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[@id='video-title']")))])
        return results
    except:
        return []

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

Answered By: undetected Selenium

What is the fastest/ most lightweight way of getting html after javascript have excuted?

Question:

Answers: