Get Youtube's most replayed data through web scraping

Question:

I want to get data out of youtube’s "heat-map" feature, which is present in videos with certain features. This is an example. I want to retrieve this data somehow yet Youtube API’s don’t provide it and, this api doesn’t always work. I’m aware they probably use the same approach, but I want to be able to have a reliable source of information. As of this approach (web-scrapping), I have tried using selenium, with the XPath of the element (you can find in the html of the video if you search for the class ytp-heat-map-path, like this):

driver = webdriver.Firefox()
driver.get("https://www.youtube.com/watch?v=09wcDevb1q4")
        
while len(driver.find_elements(By.XPATH,"/html/body/ytd-app/div[1]/ytd-page-manager/ytd-watch-flexy/div[3]/div/ytd-player/div/div/div[31]/div[1]/div[1]/div[2]/svg/defs/clipPath/path")) == 0:
            pass


a = driver.find_element(By.XPATH,"/html/body/ytd-app/div[1]/ytd-page-manager/ytd-watch-flexy/div[3]/div/ytd-player/div/div/div[31]/div[1]/div[1]/div[2]/svg/defs/clipPath/path")

I have also tried with beautifulSoup, finding the class:

mydivs = soup.find_all("path", {"class": "ytp-heat-map-path"})

None of them can find the data. I’m happy to find a solution to this with web scrapping or any other method. Thanks.

Asked By: Kevin M.

||

Answers:

That desired data is under an attribute value of d with path tag. So you can try the next example.

from selenium import webdriver
import time
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service

#You change this portion into Firefox instead
webdriver_service = Service("./chromedriver") 
driver = webdriver.Chrome(service=webdriver_service)


driver.get('https://www.youtube.com/watch?v=09wcDevb1q4')
driver.maximize_window()
time.sleep(8)


soup = BeautifulSoup(driver.page_source,"html.parser")

mydivs = soup.find("path", {"class": "ytp-heat-map-path"}).get('d')
print(mydivs)

Output:

M 0.0,100.0 C 1.0,87.5 2.0,42.2 5.0,37.6 C 8.0,33.1 11.0,69.1 15.0,77.3 C 19.0,85.6 21.0,79.1 25.0,78.8 C 29.0,78.5 
31.0,74.3 35.0,75.7 C 39.0,77.1 41.0,82.9 45.0,85.8 C 49.0,88.6 51.0,89.2 55.0,90.0 C 59.0,90.8 61.0,90.0 65.0,90.0 
C 69.0,90.0 71.0,90.0 75.0,90.0 C 79.0,90.0 81.0,90.0 85.0,90.0 C 89.0,90.0 91.0,90.0 95.0,90.0 C 99.0,90.0 101.0,90.0 105.0,90.0 C 109.0,90.0 111.0,90.0 115.0,90.0 C 119.0,90.0 121.0,90.0 125.0,90.0 C 129.0,90.0 131.0,90.0 135.0,90.0 C 139.0,90.0 141.0,90.0 145.0,90.0 C 149.0,90.0 151.0,90.0 155.0,90.0 C 159.0,90.0 161.0,90.0 165.0,90.0 C 169.0,90.0 171.0,90.7 175.0,90.0 C 179.0,89.3 181.0,88.3 185.0,86.6 C 189.0,84.9 191.0,82.1 195.0,81.7 C 199.0,81.3 201.0,84.4 205.0,84.6 C 209.0,84.8 211.0,83.9 215.0,82.6 C 219.0,81.4 221.0,79.8 225.0,78.3 C 229.0,76.7 231.0,73.5 235.0,74.9 C 239.0,76.3 241.0,82.4 245.0,85.1 C 249.0,87.9 251.0,87.9 255.0,88.8 C 259.0,89.8 261.0,89.5 265.0,89.7 C 269.0,89.9 271.0,90.1 275.0,90.0 C 279.0,89.9 281.0,89.3 285.0,89.1 C 289.0,89.0 291.0,89.1 295.0,89.3 C 299.0,89.5 301.0,89.9 305.0,90.0 C 309.0,90.1 311.0,90.9 315.0,89.8 C 319.0,88.6 321.0,84.5 325.0,84.3 C 329.0,84.1 331.0,87.5 335.0,88.6 C 339.0,89.8 341.0,91.0 345.0,90.0 C 349.0,89.0 351.0,85.9 355.0,83.8 C 359.0,81.7 361.0,78.8 365.0,79.5 C 369.0,80.2 371.0,85.2 375.0,87.3 C 379.0,89.4 381.0,89.5 385.0,90.0 C 389.0,90.5 391.0,90.0 395.0,90.0 C 399.0,90.0 401.0,90.2 405.0,89.9 C 409.0,89.5 411.0,88.5 415.0,88.4 C 419.0,88.3 421.0,89.1 425.0,89.5 C 429.0,89.8 431.0,89.9 435.0,90.0 C 439.0,90.1 441.0,90.0 445.0,90.0 C 449.0,90.0 451.0,91.3 455.0,90.0 C 459.0,88.7 461.0,87.1 465.0,83.4 C 
469.0,79.7 471.0,73.7 475.0,71.6 C 479.0,69.6 481.0,71.7 485.0,73.0 C 489.0,74.4 491.0,76.7 495.0,78.3 C 499.0,79.9 
501.0,80.9 505.0,80.9 C 509.0,80.9 511.0,77.9 515.0,78.3 C 519.0,78.8 521.0,81.3 525.0,83.2 C 529.0,85.2 531.0,86.7 
535.0,88.1 C 539.0,89.5 541.0,89.6 545.0,90.0 C 549.0,90.4 551.0,90.5 555.0,90.0 C 559.0,89.5 561.0,87.5 565.0,87.4 
C 569.0,87.2 571.0,88.7 575.0,89.2 C 579.0,89.8 581.0,89.8 585.0,90.0 C 589.0,90.2 591.0,90.1 595.0,90.0 C 599.0,89.9 601.0,89.5 605.0,89.5 C 609.0,89.5 611.0,89.9 615.0,90.0 C 619.0,90.1 621.0,90.0 625.0,90.0 C 629.0,90.0 631.0,90.0 635.0,90.0 C 639.0,90.0 641.0,90.6 645.0,90.0 C 649.0,89.4 651.0,87.7 655.0,87.2 C 659.0,86.7 661.0,86.8 665.0,87.3 C 669.0,87.9 671.0,89.5 675.0,90.0 C 679.0,90.5 681.0,90.4 685.0,90.0 C 689.0,89.6 691.0,89.1 695.0,88.1 C 699.0,87.1 701.0,86.5 705.0,85.0 C 709.0,83.5 711.0,81.4 715.0,80.5 C 719.0,79.6 721.0,80.6 725.0,80.5 C 729.0,80.4 731.0,80.5 735.0,80.0 C 739.0,79.5 741.0,78.3 745.0,78.2 C 749.0,78.1 751.0,78.8 755.0,79.5 C 759.0,80.2 761.0,79.7 765.0,81.8 C 769.0,83.9 771.0,88.4 775.0,90.0 C 779.0,91.6 781.0,90.0 785.0,90.0 C 789.0,90.0 791.0,90.0 795.0,90.0 C 799.0,90.0 801.0,90.0 805.0,90.0 C 809.0,90.0 811.0,90.0 815.0,90.0 C 819.0,90.0 821.0,90.3 825.0,90.0 C 829.0,89.7 831.0,90.8 835.0,88.7 C 839.0,86.6 841.0,82.7 845.0,79.5 C 849.0,76.4 851.0,74.5 855.0,73.0 C 859.0,71.5 861.0,72.3 865.0,72.0 C 869.0,71.6 871.0,70.4 875.0,71.1 C 879.0,71.8 881.0,77.0 885.0,75.6 C 889.0,74.2 891.0,74.0 895.0,64.3 C 899.0,54.6 901.0,39.9 905.0,27.1 C 909.0,14.2 911.0,-0.4 915.0,0.0 C 919.0,0.4 921.0,15.3 925.0,29.2 C 929.0,43.1 931.0,60.0 935.0,69.6 C 939.0,79.3 941.0,75.1 945.0,77.5 C 949.0,79.8 951.0,79.9 955.0,81.3 C 959.0,82.6 961.0,82.4 965.0,84.1 C 969.0,85.8 971.0,88.8 975.0,90.0 C 979.0,91.2 981.0,90.4 985.0,90.0 C 989.0,89.6 992.0,88.2 995.0,87.8 C 998.0,87.3 999.0,85.3 1000.0,87.8 C 1001.0,90.2 1000.0,97.6 1000.0,100.0
Answered By: Fazlul

I want to be able to have a reliable source of information

Note that by web-scraping you can’t have a better stability than my open-source API you are referring to. I guess the stability issue you are referring to is that when web-scraping is abused, YouTube servers suspend temporarily your ability to retrieve the most replayed data.

As far as I know nobody using their own instance of my API for their own private usage have faced this issue. So I guess you only used the official instance of my API which, by its numerous users, abuses from YouTube UI servers, and so it is regularly suspended.

So the solutions are:

  • To try with your own private instance of my API.
  • Otherwise just directly parse the ytInitialData JavaScript variable in the HTML, as I did in my API, that way you don’t need a JavaScript interpreter such as Selenium.
Answered By: Benjamin Loison

The way to get it through the ytInitialData JavaScript variable in the HTML:

soup = BS(requests.get(url).text, "html.parser")
data = re.search(r"var ytInitialData = ({.*?});", soup.prettify()).group(1)
data = json.loads(data)
data['playerOverlays']['playerOverlayRenderer']['decoratedPlayerBarRenderer']['decoratedPlayerBarRenderer']['playerBar']['multiMarkersPlayerBarRenderer']['markersMap']
Answered By: Kevin M.