Scrapy Unable to scrape API

Question:

I am trying to crawl API using scrapy form this link

The thing is the API request I was trying to get solves my all issues but I am not able to load the response in json form and I cannot proceed further.

Though code seems long but the code is only long due to header and cookies please suggest me how I can improve and find solution

Here is my scrapy code I did

from datetime import datetime
import json
from urllib.parse import urlencode

import scrapy
from bs4 import BeautifulSoup
from liveshare.items import AGMSpiderItems


class SubIndexSpider(scrapy.Spider):
    name = "subindexes"

    def start_requests(self):
        headers = {
            'authority': 'merolagani.com',
            'accept': 'application/json, text/javascript, */*; q=0.01',
            'accept-language': 'en-GB,en;q=0.9,en-US;q=0.8,ne;q=0.7,ru;q=0.6',
            'cache-control': 'no-cache',
            # 'cookie': 'ASP.NET_SessionId=bbjd1loebaad4ha2qwwxdcfp; _ga=GA1.2.810096005.1667463342; _gid=GA1.2.1263273763.1673850832; _gat=1; __atuvc=4%7C3; __atuvs=63c4efd0a14c6c9b003',
            'pragma': 'no-cache',
            'referer': 'https://merolagani.com/MarketSummary.aspx',
            'sec-ch-ua': '"Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"',
            'sec-ch-ua-mobile': '?0',
            'sec-ch-ua-platform': '"Linux"',
            'sec-fetch-dest': 'empty',
            'sec-fetch-mode': 'cors',
            'sec-fetch-site': 'same-origin',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
            'x-requested-with': 'XMLHttpRequest',
        }
        params = {
            'type': 'market_summary',
        }

        cookies = {
            'ASP.NET_SessionId': 'bbjd1loebaad4ha2qwwxdcfp',
            '_ga': 'GA1.2.810096005.1667463342',
            '_gid': 'GA1.2.1263273763.1673850832',
            '_gat': '1',
            '__atuvc': '4%7C3',
            '__atuvs': '63c4efd0a14c6c9b003',
        }
        api_url = f'https://merolagani.com/handlers/webrequesthandler.ashx{urlencode(params)}'
        yield scrapy.Request(
            url=api_url,
            method='GET',
            headers=headers,
            cookies=cookies,
            callback=self.parse,
            dont_filter=True
        )

    def parse(self, response):
        print(response.headers)
        print(response.body)
        json_response = json.loads(response.body)
        print(json_response)

But I am getting JSON decode error I can’t figure out the issue.

error traceback

  File "C:UsersNavarAppDataLocalProgramsPythonPython39libjsondecoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 3 column 1 (char 4)
Asked By: astro geek

||

Answers:

I’ve used the code – that I simplified – and I got no errors, the JSON data is returned successfully.

Code:

url_api = "https://merolagani.com/handlers/webrequesthandler.ashx?type=market_summary"

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36", 
    "Referer": "https://merolagani.com"
}

page = requests.get(url_api, headers=headers)
js_data = json.loads(page.text)
print(js_data)

Check the JSON result here at anotepad.com

Probably the error is in the response of your code – i.e. the response is NOT a JSON object.

It looks like you’re making a GET request to the API endpoint, including the necessary headers and cookies, and trying to parse the response as JSON. There are a few things you can try to improve your code:

  1. Check the response status code to ensure that the request is
    successful. If the status code is not 200, then something may be
    wrong with your request and you should check the headers and cookies
    again to make sure they are correct.
  2. Make sure that the API is actually returning a JSON response. You
    can check the Content-Type header in the response to see what type
    of data the API is returning.
  3. Make sure that the API is actually returning a JSON response. You
    can check the Content-Type header in the response to see what type
    of data the API is returning.
  4. If the API is returning a JSON response, but you are still unable to
    parse it, you can try using the json.loads() function from the json
    library to parse the response.
  5. Check the API documentation to see if it requires any other
    parameters in the request.
  6. Check the API documentation to see if there are any usage limits or
    restrictions that may be preventing you from getting the desired
    data.

If the API is returning JSON in string format then you could use json.loads(response.text) instead of json.loads(response.body)

If none of the above steps work, then you can try to contact the API provider for support.

Keep in mind that scraping an API without permission is against the terms of service in some cases, so it is important to check the API’s terms of service before scraping.

Answered By: Hussam