Scraping data from CME
Question:
I am trying to webscrape data from CME exchange:
https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021
I have the following code snippet:
import requests as r
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
header = {'User-Agent': user_agent}
link = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021'
page = r.get(link,headers=header)
raw_json = json.loads(page.text)
While it works perfectly well on a local computer, it totally hangs on remote hosting servers (Digital Ocean, Hetzner). I have also tried to curl url but it gives a timeout error without additional details.
Do I need to use selenium for this? I wonder what can be different between scraping data from a local machine and the hosting server.
I don’t know how to resolve this. Hope you can give me some clues.
Answers:
You can get json
response from URL itself not requried page.text
to transform in to json
Just use this directly may be it could work
data=page.json()
Apparently, some hosting providers are blocked by CME. You should look for one which is not blocked and you can use it as a proxy server. That’s the solution that worked for me. However, now I am thinking that this could be related to IPv6 settings on the server. Try to disable IPv6 connection and it will automatically fall back into IPv4.
on Ubuntu
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
Just found the solution for this problem.
Reason for this behaviour its due to the protocol HTTP/2.
A way to test this its upgrading curl
, since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS connections.
Hope it helps!
I am trying to webscrape data from CME exchange:
https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021
I have the following code snippet:
import requests as r
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
header = {'User-Agent': user_agent}
link = 'https://www.cmegroup.com/CmeWS/mvc/Settlements/Futures/Settlements/425/FUT?tradeDate=11/05/2021'
page = r.get(link,headers=header)
raw_json = json.loads(page.text)
While it works perfectly well on a local computer, it totally hangs on remote hosting servers (Digital Ocean, Hetzner). I have also tried to curl url but it gives a timeout error without additional details.
Do I need to use selenium for this? I wonder what can be different between scraping data from a local machine and the hosting server.
I don’t know how to resolve this. Hope you can give me some clues.
You can get json
response from URL itself not requried page.text
to transform in to json
Just use this directly may be it could work
data=page.json()
Apparently, some hosting providers are blocked by CME. You should look for one which is not blocked and you can use it as a proxy server. That’s the solution that worked for me. However, now I am thinking that this could be related to IPv6 settings on the server. Try to disable IPv6 connection and it will automatically fall back into IPv4.
on Ubuntu
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1
sudo sysctl -w net.ipv6.conf.lo.disable_ipv6=1
Just found the solution for this problem.
Reason for this behaviour its due to the protocol HTTP/2.
A way to test this its upgrading curl
, since 7.47.0, the curl tool enables HTTP/2 by default for HTTPS connections.
Hope it helps!