Why my program to scrape NSE website gets blocked in servers but works in local?

Question

This python code is running on the local computer but is not running on

Digital Ocean
Amazon AWS
Google Collab
Heroku

and many other VPS. It shows different errors at different times.

import requests

headers = {
    'authority': 'beta.nseindia.com',
    'cache-control': 'max-age=0',
    'dnt': '1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
    'sec-fetch-user': '?1',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'none',
    'sec-fetch-mode': 'navigate',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}

params = (
    ('symbol', 'BANKNIFTY'),
)

response = requests.get('https://nseindia.com/api/quote-derivative', headers=headers, params=params)

#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY', headers=headers)

Is there any mistake in the above code? What I am missing? I copied the header data from Chrome Developer Tools> Network in incognito mode used https://curl.trillworks.com/ site to generate the python code from the curl command.

But the curl command is working fine and giving fine output-

curl "https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8"  --compressed

How come the curl command is working but the python generated out of the curl command is not?

Asked By: user12713281

||

Source

Answer 1

I stumbled into the same problem. I do not know the proper pythonic solution with the python-requests module. There is a high chance NSE just blocks it.

So here is a pythonic solution that will work. It looks lame but I’m using it without digging deep –

import subprocess
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))

subprocess.Popen('curl "https://www.nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed  -o maxpain.txt', shell=True)

f=open("maxpain.txt","r")
var=f.read()
print(var)

It basically runs the curl function and sends the output to a file and read the file back. That’s it.

Answered By: Amit Ghosh

Answer 2

There are 2 things that are to be noted.

Request header needs to have ‘host’ and ‘user-agent’

__request_headers = {
        'Host':'www.nseindia.com', 
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
        'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
        'Accept-Language':'en-US,en;q=0.5', 
        'Accept-Encoding':'gzip, deflate, br',
        'DNT':'1', 
        'Connection':'keep-alive', 
        'Upgrade-Insecure-Requests':'1',
        'Pragma':'no-cache',
        'Cache-Control':'no-cache',    
    }

Following cookies are dynamically set, which needs to be fetched and set dynamically.

'nsit',
'nseappid',
'ak_bmsc'

These are set from nse based on the functionality that is being used.
This example: top gainers / losers.
I tried to get top gainers and losers list, in which the request is blocked without these cookies.

try:
            nse_url = 'https://www.nseindia.com/market-data/top-gainers-loosers'
            url = 'https://www.nseindia.com/api/live-analysis-variations?index=gainers'
            resp = requests.get(url=nse_url, headers=__request_headers)
            if resp.ok:
                req_cookies = dict(nsit=resp.cookies['nsit'], nseappid=resp.cookies['nseappid'], ak_bmsc=resp.cookies['ak_bmsc'])
                tresp = requests.get(url=url, headers=__request_headers, cookies=req_cookies)
                result = tresp.json()
                res_data = result["NIFTY"]["data"] if "NIFTY" in result and "data" in result["NIFTY"] else []
                if res_data != None and len(res_data) > 0:
                    __top_list = res_data
        except OSError as err:
            logger.error('Unable to fetch data')

Another thing to be noted is that these requests are blocked by NSE from most of the cloud VMs like AWS, GCP. I was able to get it from personal windows machine, but not from AWS or GCP.

Answered By: mohu

Answer 3

i need help for the same kind of project…
Why my program to scrape NSE website using CURL in a Python program is not giving proper JSON output?

Answered By: Paras Awasthi

Why my program to scrape NSE website gets blocked in servers but works in local?

Question:

Answers: