Why my program to scrape NSE website gets blocked in servers but works in local?
Question:
This python code is running on the local computer but is not running on
- Digital Ocean
- Amazon AWS
- Google Collab
- Heroku
and many other VPS. It shows different errors at different times.
import requests
headers = {
'authority': 'beta.nseindia.com',
'cache-control': 'max-age=0',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'sec-fetch-user': '?1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}
params = (
('symbol', 'BANKNIFTY'),
)
response = requests.get('https://nseindia.com/api/quote-derivative', headers=headers, params=params)
#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY', headers=headers)
Is there any mistake in the above code? What I am missing? I copied the header data from Chrome Developer Tools> Network in incognito mode used https://curl.trillworks.com/ site to generate the python code from the curl command.
But the curl command is working fine and giving fine output-
curl "https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed
How come the curl command is working but the python generated out of the curl command is not?
Answers:
I stumbled into the same problem. I do not know the proper pythonic solution with the python-requests module. There is a high chance NSE just blocks it.
So here is a pythonic solution that will work. It looks lame but I’m using it without digging deep –
import subprocess
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))
subprocess.Popen('curl "https://www.nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed -o maxpain.txt', shell=True)
f=open("maxpain.txt","r")
var=f.read()
print(var)
It basically runs the curl function and sends the output to a file and read the file back. That’s it.
There are 2 things that are to be noted.
- Request header needs to have ‘host’ and ‘user-agent’
__request_headers = {
'Host':'www.nseindia.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language':'en-US,en;q=0.5',
'Accept-Encoding':'gzip, deflate, br',
'DNT':'1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Pragma':'no-cache',
'Cache-Control':'no-cache',
}
- Following cookies are dynamically set, which needs to be fetched and set dynamically.
'nsit',
'nseappid',
'ak_bmsc'
These are set from nse based on the functionality that is being used.
This example: top gainers / losers.
I tried to get top gainers and losers list, in which the request is blocked without these cookies.
try:
nse_url = 'https://www.nseindia.com/market-data/top-gainers-loosers'
url = 'https://www.nseindia.com/api/live-analysis-variations?index=gainers'
resp = requests.get(url=nse_url, headers=__request_headers)
if resp.ok:
req_cookies = dict(nsit=resp.cookies['nsit'], nseappid=resp.cookies['nseappid'], ak_bmsc=resp.cookies['ak_bmsc'])
tresp = requests.get(url=url, headers=__request_headers, cookies=req_cookies)
result = tresp.json()
res_data = result["NIFTY"]["data"] if "NIFTY" in result and "data" in result["NIFTY"] else []
if res_data != None and len(res_data) > 0:
__top_list = res_data
except OSError as err:
logger.error('Unable to fetch data')
Another thing to be noted is that these requests are blocked by NSE from most of the cloud VMs like AWS, GCP. I was able to get it from personal windows machine, but not from AWS or GCP.
i need help for the same kind of project…
Why my program to scrape NSE website using CURL in a Python program is not giving proper JSON output?
This python code is running on the local computer but is not running on
- Digital Ocean
- Amazon AWS
- Google Collab
- Heroku
and many other VPS. It shows different errors at different times.
import requests
headers = {
'authority': 'beta.nseindia.com',
'cache-control': 'max-age=0',
'dnt': '1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36',
'sec-fetch-user': '?1',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9,hi;q=0.8',
}
params = (
('symbol', 'BANKNIFTY'),
)
response = requests.get('https://nseindia.com/api/quote-derivative', headers=headers, params=params)
#NB. Original query string below. It seems impossible to parse and
#reproduce query strings 100% accurately so the one below is given
#in case the reproduced version is not "correct".
# response = requests.get('https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY', headers=headers)
Is there any mistake in the above code? What I am missing? I copied the header data from Chrome Developer Tools> Network in incognito mode used https://curl.trillworks.com/ site to generate the python code from the curl command.
But the curl command is working fine and giving fine output-
curl "https://nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed
How come the curl command is working but the python generated out of the curl command is not?
I stumbled into the same problem. I do not know the proper pythonic solution with the python-requests module. There is a high chance NSE just blocks it.
So here is a pythonic solution that will work. It looks lame but I’m using it without digging deep –
import subprocess
import os
os.chdir(os.path.dirname(os.path.abspath(__file__)))
subprocess.Popen('curl "https://www.nseindia.com/api/quote-derivative?symbol=BANKNIFTY" -H "authority: beta.nseindia.com" -H "cache-control: max-age=0" -H "dnt: 1" -H "upgrade-insecure-requests: 1" -H "user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36" -H "sec-fetch-user: ?1" -H "accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9" -H "sec-fetch-site: none" -H "sec-fetch-mode: navigate" -H "accept-encoding: gzip, deflate, br" -H "accept-language: en-US,en;q=0.9,hi;q=0.8" --compressed -o maxpain.txt', shell=True)
f=open("maxpain.txt","r")
var=f.read()
print(var)
It basically runs the curl function and sends the output to a file and read the file back. That’s it.
There are 2 things that are to be noted.
- Request header needs to have ‘host’ and ‘user-agent’
__request_headers = {
'Host':'www.nseindia.com',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language':'en-US,en;q=0.5',
'Accept-Encoding':'gzip, deflate, br',
'DNT':'1',
'Connection':'keep-alive',
'Upgrade-Insecure-Requests':'1',
'Pragma':'no-cache',
'Cache-Control':'no-cache',
}
- Following cookies are dynamically set, which needs to be fetched and set dynamically.
'nsit',
'nseappid',
'ak_bmsc'
These are set from nse based on the functionality that is being used.
This example: top gainers / losers.
I tried to get top gainers and losers list, in which the request is blocked without these cookies.
try:
nse_url = 'https://www.nseindia.com/market-data/top-gainers-loosers'
url = 'https://www.nseindia.com/api/live-analysis-variations?index=gainers'
resp = requests.get(url=nse_url, headers=__request_headers)
if resp.ok:
req_cookies = dict(nsit=resp.cookies['nsit'], nseappid=resp.cookies['nseappid'], ak_bmsc=resp.cookies['ak_bmsc'])
tresp = requests.get(url=url, headers=__request_headers, cookies=req_cookies)
result = tresp.json()
res_data = result["NIFTY"]["data"] if "NIFTY" in result and "data" in result["NIFTY"] else []
if res_data != None and len(res_data) > 0:
__top_list = res_data
except OSError as err:
logger.error('Unable to fetch data')
Another thing to be noted is that these requests are blocked by NSE from most of the cloud VMs like AWS, GCP. I was able to get it from personal windows machine, but not from AWS or GCP.
i need help for the same kind of project…
Why my program to scrape NSE website using CURL in a Python program is not giving proper JSON output?