Scraping data off Morningstar – Portfolio Screen

Question:

I am trying to scrape data from this link https://www.morningstar.com/funds/xnas/gibix/portfolio — basically all the data I can get, but particularly the Fixed Income Style Table and the Exposure, Bond Breakdown table.

Here is my code:

import requests
from selenium import webdriver
import pandas as pd
link = 'https://api-global.morningstar.com/sal-service/v1/fund/portfolio/holding/v2/F00000MUR2/data'

headers = {
    'apikey': 'lstzFDEOhfFNMLikKa0am9mgEKLBl49T',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)               Chrome/100.0.4896.127 Safari/537.36'
}

payload = {
    'premiumNum': '1000',
    'freeNum': '1000',
    'languageId': 'en',
    'locale': 'en',
    'clientId': 'MDC',
    #'benchmarkId': 'mstarorcat',
    'benchmarkId': 'category',
    'component': 'sal-components-mip-holdings',
    'version': '3.59.1'
}

with requests.Session() as s:
    s.headers.update(headers)
    resp = s.get(link,params=payload)
    container = resp.json()

The above code is for what I have scraping the holdings data at the bottom. But it seems like I am having trouble figuring out what my 'component' field in my header should be. I have tried even 'sal-components-fixed-income-exposure-analysis' but to no avail.

Asked By: research51711

||

Answers:

What you are doing is not web scraping, but an API request. There’s probably a way to get the data you want through the API but you might have to discover it from their docs: https://developer.morningstar.com/developer-resources/api-visualization-library/about

But I can provide you with a code snippet for actually scraping the data from this page:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
import pandas as pd


url = 'https://www.morningstar.com/funds/xnas/gibix/portfolio'

options = webdriver.ChromeOptions()
options.add_argument('--headless=new')

with webdriver.Chrome(service=Service(ChromeDriverManager().install()),
                      options=options) as driver:
    driver.get(url)
    sleep(10)
    html = driver.page_source

tables = pd.read_html(html) #this will require lxml module

"tables" here is a list of dataframes from every table found in the page when fully loaded.

To install lxml module just pip install lxml

Ps: I tried getting the html with a request response but it’s returning another page, looks like you gotta open the page and wait until it’s fully loaded to get the correct source html.

Answered By: Arthur Querido