Why Read_HTML from Python Pandas not working?
Question:
I would like to use Python Pandas Read_HTML() function to scrape the information from Yahoo Finance table, seen in the screenshot, bordered in red.
However, I received a HTTPError: HTTP Error 404: Not Found
Here is my code output:
!pip install pandas
!pip install requests
!pip install bs4
!pip install requests_html
!pip install pytest-astropy
!pip install nest_asyncio
!pip install plotly
import pandas as pd
from bs4 import BeautifulSoup
import requests
import requests_html
import nest_asyncio
import lxml
import html5lib
nest_asyncio.apply()
url_link = "https://finance.yahoo.com/quote/NFLX/history?p=NFLX%27"
read_html_pandas_data = pd.read_html(url_link)
Answers:
Because an user-agent header is needed which can’t be specified with read_html
. You could grab table first with requests
, specifying the appropriate header, then handover to pandas:
from pandas import read_html as rh
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/quote/NFLX/history?p=NFLX%27', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
table = rh(str(soup.select_one('[data-test="historical-prices"]')))[0]
print(table)
Try as follows:
import pandas as pd
import requests
url_link = 'https://finance.yahoo.com/quote/NFLX/history?p=NFLX%27'
r = requests.get(url_link,headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
read_html_pandas_data = pd.read_html(r.text)[0]
print(read_html_pandas_data)
I would like to use Python Pandas Read_HTML() function to scrape the information from Yahoo Finance table, seen in the screenshot, bordered in red.
However, I received a HTTPError: HTTP Error 404: Not Found
Here is my code output:
!pip install pandas
!pip install requests
!pip install bs4
!pip install requests_html
!pip install pytest-astropy
!pip install nest_asyncio
!pip install plotly
import pandas as pd
from bs4 import BeautifulSoup
import requests
import requests_html
import nest_asyncio
import lxml
import html5lib
nest_asyncio.apply()
url_link = "https://finance.yahoo.com/quote/NFLX/history?p=NFLX%27"
read_html_pandas_data = pd.read_html(url_link)
Because an user-agent header is needed which can’t be specified with read_html
. You could grab table first with requests
, specifying the appropriate header, then handover to pandas:
from pandas import read_html as rh
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://finance.yahoo.com/quote/NFLX/history?p=NFLX%27', headers = {'User-Agent':'Mozilla/5.0'})
soup = bs(r.content, 'lxml')
table = rh(str(soup.select_one('[data-test="historical-prices"]')))[0]
print(table)
Try as follows:
import pandas as pd
import requests
url_link = 'https://finance.yahoo.com/quote/NFLX/history?p=NFLX%27'
r = requests.get(url_link,headers ={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
read_html_pandas_data = pd.read_html(r.text)[0]
print(read_html_pandas_data)