WebScraping with BeautifulSoup or LXML.HTML
Question:
I have seen some webcasts and need help in trying to do this:
I have been using lxml.html. Yahoo recently changed the web structure.
target page;
http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true
In Chrome using inspector: I see the data in
//*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table
then some more code
How Do get this data out into a list.
I want to change to other stock from “LLY” to “Msft”?
How do I switch between dates….And get all months.
Answers:
I know you said you can’t use lxml.html
. But here is how to do it using that library, because it is very good library. So I provide the code using it, for completeness, since I don’t use BeautifulSoup
anymore — it’s unmaintained, slow and has ugly API.
The code below parses the page and writes the results in a csv file.
import lxml.html
import csv
doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[@class='yfnc_tabledata1']]")[0]
with open('results.csv', 'wb') as f:
cf = csv.writer(f)
# find all trs inside that table:
for tr in table.xpath('./tr'):
# add the text of all tds inside each tr to a list
row = [td.text_content().strip() for td in tr.xpath('./td')]
# write the list to the csv file:
cf.writerow(row)
That’s it! lxml.html
is so simple and nice!! Too bad you can’t use it.
Here’s some lines from the results.csv
file that was generated:
LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50
Here is a simple example to extract all data from the stock tables:
import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[@class="details-table quote-table Fz-m"]'):
rows = []
for tr in table.xpath('./tbody/tr'):
row = [td.text_content().strip() for td in tr.xpath('./td')]
rows.append(row)
print rows
Then to extract for different stocks and dates you need to change the URL. Here is Msft for the previous day:
http://finance.yahoo.com/q/op?s=msft&m=2014-11-14
If you’d like raw json try MSN
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/
You can also specify an expiration date ?date=11/14/2014
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/?date=11/14/2014
If you prefer Yahoo json
http://finance.yahoo.com/q/op?s=LLY
But you have to extract it from the html
import re
m = re.search('<script>.+({"applet_type":"td-applet-options-table".+);</script>', resp.content)
data = json.loads(m.group(1))
as_dicts = data['models']['applet_model']['data']['optionData']['_options'][0]['straddles']
Expirations are here
data['models']['applet_model']['data']['optionData']['expirationDates']
Convert iso to unix timestamp as here
Then re-request the other expirations with the unix timestamp
http://finance.yahoo.com/q/op?s=LLY&date=1414713600
Basing the Answer on @hoju:
import lxml.html
import calendar
from datetime import datetime
exDate = "2014-11-22"
symbol = "LLY"
dt = datetime.strptime(exDate, '%Y-%m-%d')
ym = calendar.timegm(dt.utctimetuple())
url = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc = lxml.html.parse(url)
table = doc.xpath('//table[@class="details-table quote-table Fz-m"]/tbody/tr')
rows = []
for tr in table:
d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
rows.append(d)
print rows
I have seen some webcasts and need help in trying to do this:
I have been using lxml.html. Yahoo recently changed the web structure.
target page;
http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true
In Chrome using inspector: I see the data in
//*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table
then some more code
How Do get this data out into a list.
I want to change to other stock from “LLY” to “Msft”?
How do I switch between dates….And get all months.
I know you said you can’t use lxml.html
. But here is how to do it using that library, because it is very good library. So I provide the code using it, for completeness, since I don’t use BeautifulSoup
anymore — it’s unmaintained, slow and has ugly API.
The code below parses the page and writes the results in a csv file.
import lxml.html
import csv
doc = lxml.html.parse('http://finance.yahoo.com/q/os?s=lly&m=2011-04-15')
# find the first table contaning any tr with a td with class yfnc_tabledata1
table = doc.xpath("//table[tr/td[@class='yfnc_tabledata1']]")[0]
with open('results.csv', 'wb') as f:
cf = csv.writer(f)
# find all trs inside that table:
for tr in table.xpath('./tr'):
# add the text of all tds inside each tr to a list
row = [td.text_content().strip() for td in tr.xpath('./td')]
# write the list to the csv file:
cf.writerow(row)
That’s it! lxml.html
is so simple and nice!! Too bad you can’t use it.
Here’s some lines from the results.csv
file that was generated:
LLY110416C00017500,N/A,0.00,17.05,18.45,0,0,17.50,LLY110416P00017500,0.01,0.00,N/A,0.03,0,182
LLY110416C00020000,15.70,0.00,14.55,15.85,0,0,20.00,LLY110416P00020000,0.06,0.00,N/A,0.03,0,439
LLY110416C00022500,N/A,0.00,12.15,12.80,0,0,22.50,LLY110416P00022500,0.01,0.00,N/A,0.03,2,50
Here is a simple example to extract all data from the stock tables:
import urllib
import lxml.html
html = urllib.urlopen('http://finance.yahoo.com/q/op?s=lly&m=2014-11-15').read()
doc = lxml.html.fromstring(html)
# scrape figures from each stock table
for table in doc.xpath('//table[@class="details-table quote-table Fz-m"]'):
rows = []
for tr in table.xpath('./tbody/tr'):
row = [td.text_content().strip() for td in tr.xpath('./td')]
rows.append(row)
print rows
Then to extract for different stocks and dates you need to change the URL. Here is Msft for the previous day:
http://finance.yahoo.com/q/op?s=msft&m=2014-11-14
If you’d like raw json try MSN
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/
You can also specify an expiration date ?date=11/14/2014
http://www.msn.com/en-us/finance/stocks/optionsajax/126.1.UNH.NYS/?date=11/14/2014
If you prefer Yahoo json
http://finance.yahoo.com/q/op?s=LLY
But you have to extract it from the html
import re
m = re.search('<script>.+({"applet_type":"td-applet-options-table".+);</script>', resp.content)
data = json.loads(m.group(1))
as_dicts = data['models']['applet_model']['data']['optionData']['_options'][0]['straddles']
Expirations are here
data['models']['applet_model']['data']['optionData']['expirationDates']
Convert iso to unix timestamp as here
Then re-request the other expirations with the unix timestamp
http://finance.yahoo.com/q/op?s=LLY&date=1414713600
Basing the Answer on @hoju:
import lxml.html
import calendar
from datetime import datetime
exDate = "2014-11-22"
symbol = "LLY"
dt = datetime.strptime(exDate, '%Y-%m-%d')
ym = calendar.timegm(dt.utctimetuple())
url = 'http://finance.yahoo.com/q/op?s=%s&date=%s' % (symbol, ym,)
doc = lxml.html.parse(url)
table = doc.xpath('//table[@class="details-table quote-table Fz-m"]/tbody/tr')
rows = []
for tr in table:
d = [td.text_content().strip().replace(',','') for td in tr.xpath('./td')]
rows.append(d)
print rows