Cannot scrape dataid from Morningstar – How can I access the Network inspection tool from Python?

Question:

I’m trying to scrape Morningstar.com to get financial data and prices of each fund available on the website. Fortunately I have no problem at scraping financial data (holdings, asset allocation, portfolio, risk, etc.), but when it comes to find the URL that hosts the daily prices in JSON format for each fund, there is a "dataid" value that is not available in the HTML code and without it there is no way to know the exact URL that hosts all the prices.

enter image description here

I have tried to print the whole page as text for many funds, and none of them show in the HTML code the "dataid" value that I need in order to get the prices. The URL that hosts the prices also includes the "secid", which is scrapeable very easily but has no relationship at all with the "dataid" that I need to scrape.

enter image description here

import requests
from lxml import html
import re
import json

quote_page = "https://www.morningstar.com/etfs/arcx/aadr/quote.html"
prices1 = "https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids="
prices2 = "&dataid="
prices3 = "&startdate="
prices4 = "&enddate="
starting_date = "2018-01-01"
ending_date = "2018-12-28"

quote_html = requests.get(quote_page, timeout=10)
quote_tree = html.fromstring(quote_html.text)
security_id = re.findall('''meta name=['"]secId['"]s*content=['"](.*?)['"]''', quote_html.text)[0]
security_type = re.findall('''meta name=['"]securityType['"]s*content=['"](.*?)['"]''', quote_html.text)[0]

data_id = "8225"

daily_prices_url = prices1 + security_id + ";" + security_type + prices2 + data_id + prices3 + starting_date + prices4 + ending_date
daily_prices_html = requests.get(daily_prices_url, timeout=10)
json_prices = daily_prices_html.json()
for json_price in json_prices["data"]["r"]:
    j_prices = json_price["t"]
    for j_price in j_prices:
        daily_prices = j_price["d"]
        for daily_price in daily_prices:
            print(daily_price["i"] + " || " + daily_price["v"])

The code above works for the "AADR" ETF only because I copied and pasted the "dataid" value manually in the "data_id" variable, and without this piece of information there is no way to access the daily prices. I would not like to use Selenium as alternative to find the "dataid" because it is a very slow tool and my intention is to scrape data for more than 28k funds, so I have tried only robot web-scraping methods.
Do you have any suggestion on how to access the Network inspection tool, which is the only source I have found so far that shows the "dataid"?
Thanks in advance

Asked By: Stefano Leone

||

Answers:

The data id may not be that important. I varied the code F00000412E that is associated with AADR whilst keeping the data id constant.

I got a list of all those codes from here:

https://www.firstrade.com/scripts/free_etfs/io.php

Then add the code of choice into your url e.g.

[
    "AIA",
    "iShares Asia 50 ETF",
    "FOUSA06MPQ"
  ]

Use FOUSA06MPQ

https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=FOUSA06MPQ;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30

You can verify the values by adding the other fund as a benchmark to your chart e.g. XNAS:AIA

enter image description here

28th december has value of 55.32. Compare this with JSON retrieved:

I repeated this with

[
    "ALD",
    "WisdomTree Asia Local Debt ETF",
    "F00000M8TW"
  ]

https://mschart.morningstar.com/chartweb/defaultChart?type=getcc&secids=F00000M8TW;FE&dataid=8225&startdate=2017-01-01&enddate=2018-12-30
Answered By: QHarr

dataId 8217 works well for me, irrespective of the security.

Answered By: siamond
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.