Why my program to scrape NSE website gets blocked in servers but works in local?

Question:

This python code is running on the local computer but is not running on

  1. Digital Ocean
  2. Amazon AWS
  3. Google Collab
  4. Heroku

and many other VPS. It shows different errors at different times.

import requests

headers = {
    'authority': 'beta.nseindia.com',
    'cache-control': 'max-age=0',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
    'sec-fetch-user': '?1',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}

params = (
    ('symbol', 'BANKNIFTY'),
)

response = requests.get('https://nseindia.com/api/quote-derivative', headers=headers, params=params)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY', headers=headers)

Is there any mistake in the above code? What I am missing? I copied the header data from Chrome Developer Tools> Network in incognito mode used https://curl.trillworks.com/ site to generate the python code from the curl command.

But the curl command is working fine and giving fine output-

curl "https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8"  --compressed

How come the curl command is working but the python generated out of the curl command is not?

Asked By: user12713281

||

Answers:

I stumbled into the same problem. I do not know the proper pythonic solution with the python-requests module. There is a high chance NSE just blocks it.

So here is a pythonic solution that will work. It looks lame but I’m using it without digging deep –

import subprocess
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))

subprocess.Popen('curl "https://www.nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed  -o maxpain.txt', shell=True)

f=open("maxpain.txt","r")
var=f.read()
print(var)

It basically runs the curl function and sends the output to a file and read the file back. That’s it.

Answered By: Amit Ghosh

There are 2 things that are to be noted.

  1. Request header needs to have ‘host’ and ‘user-agent’
__request_headers = {
        'Host':'www.nseindia.com', 
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
        'Accept-Language':'en-US,en;q=0.5', 
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1', 
        'Connection':'keep-alive', 
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache',    
    }
  1. Following cookies are dynamically set, which needs to be fetched and set dynamically.
'nsit',
'nseappid',
'ak_bmsc'

These are set from nse based on the functionality that is being used.
This example: top gainers / losers.
I tried to get top gainers and losers list, in which the request is blocked without these cookies.

try:
            nse_url = 'https://www.nseindia.com/market-data/top-gainers-loosers'
            url = 'https://www.nseindia.com/api/live-analysis-variations?index=gainers'
            resp = requests.get(url=nse_url, headers=__request_headers)
            if resp.ok:
                req_cookies = dict(nsit=resp.cookies['nsit'], nseappid=resp.cookies['nseappid'], ak_bmsc=resp.cookies['ak_bmsc'])
                tresp = requests.get(url=url, headers=__request_headers, cookies=req_cookies)
                result = tresp.json()
                res_data = result["NIFTY"]["data"] if "NIFTY" in result and "data" in result["NIFTY"] else []
                if res_data != None and len(res_data) > 0:
                    __top_list = res_data
        except OSError as err:
            logger.error('Unable to fetch data')

Another thing to be noted is that these requests are blocked by NSE from most of the cloud VMs like AWS, GCP. I was able to get it from personal windows machine, but not from AWS or GCP.

Answered By: mohu
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.