Scraping High Charts Data – Trouble Scraping All Data Points in Chart

Question:

I’m trying to scrape data from a High Charts table from the following URL:

https://www.pricecharting.com/game/pal-nes/legend-of-zelda

The following code will grab the data from the default chart data labelled (Loose), but I need to extract data from the other items listed on the chart. (CIB, New, Graded, Boxed, Manual). I’m really stuck and not sure how to get the information from the other charts.

Here’s my code which works for ‘Loose’ only (The default option).

import time

import pandas as pd
from selenium.webdriver.chrome.service import Service
from selenium import webdriver

service = Service(executable_path="/driver_selenium/geckodriver.exe")
driver = webdriver.Firefox(service=service)

website = "https://www.pricecharting.com/game/pal-nes/legend-of-zelda#completed-auctions-graded"


driver.get(website)
time.sleep(5)

temp = driver.execute_script('return window.Highcharts.charts[0]'
                             '.series[0].options.data')

data = [item[1] for item in temp]
print(data)

Bonus points if there’s a better way to extract the data which does not use Selenium, something similar to the answer here: Scrape highchart into python

Asked By: Lee Roy

||

Answers:

EDIT: We can try the following:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import json

url = 'https://www.pricecharting.com/game/pal-nes/legend-of-zelda'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("VGPC.chart_data = {"))
# print(data_script.text.split('VGPC.chart_data = ')[1].split(' VGPC.product = {')[0].split(';')[0].strip())
data = json.loads(data_script.text.split('VGPC.chart_data = ')[1].split(' VGPC.product = {')[0].split(';')[0].strip())
df = pd.DataFrame(data)
print(df)

Which returns:

boxonly cib graded  manualonly  new used
0   [1498888800000, 0]  [1498888800000, 17004]  [1498888800000, 0]  [1498888800000, 0]  [1498888800000, 0]  [1498888800000, 6510]
1   [1501567200000, 0]  [1501567200000, 7878]   [1501567200000, 0]  [1501567200000, 0]  [1501567200000, 23100]  [1501567200000, 5409]
2   [1504245600000, 0]  [1504245600000, 8879]   [1504245600000, 0]  [1504245600000, 0]  [1504245600000, 23100]  [1504245600000, 5169]
3   [1506837600000, 0]  [1506837600000, 8432]   [1506837600000, 0]  [1506837600000, 0]  [1506837600000, 37665]  [1506837600000, 4499]
4   [1509516000000, 0]  [1509516000000, 9286]   [1509516000000, 0]  [1509516000000, 0]  [1509516000000, 37665]  [1509516000000, 4513]
... ... ... ... ... ... ...

This dataframe contains 6 columns with historical prices, which are being used in that Highchart graph (prices appears to be in cents). This method avoids the overheads of selenium/chromedriver.

We can now look at individual line charts, for example at ‘new’:

df_new = pd.DataFrame(data['new'], columns = ['Date_time', 'Price'])
df_new['Date_time'] = pd.to_datetime(df_new['Date_time'], unit="ms")
print(df_new)

And the result is:

Date_time   Price
0   2017-07-01 06:00:00 0
1   2017-08-01 06:00:00 23100
2   2017-09-01 06:00:00 23100
3   2017-10-01 06:00:00 37665
4   2017-11-01 06:00:00 37665
... ... ...
Answered By: platipus_on_fire
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.