Get the S&P 500 tickers list

Question:

So I am using this series on Python for Finance and it keeps giving me error —

1) line 22, in <module> save_sp500_tickers() and 

2) line 8, in save_sp500_tickers
    soup = bs.BeautifulSoup(resp.text,'lxml')and 

3) line 165, in __init__
    % ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml.
Do you need to install a parser library?

I have been at it for a whole day and I honestly refuse to give up and any help with this would be greatly appreicated. Also if anyone has any suggestions for something other than pickle and can help write something that allows me to call the SP500 without pickle that would be great.

import bs4 as bs    
import pickle    
import requests    
import lxml    
def save_sp500_tickers():
    resp = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')        
    soup = bs.BeautifulSoup(resp.text,'lxml')        
    table = soup.find('table', {'class': 'wikitable sortable'})        

    tickers = []

    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)

    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    print(tickers)

    return tickers    

save_sp500_tickers()
Asked By: alex

||

Answers:

Running your code as-is works on my system. Probably, as Eric suggests, you should install lxml.

Unfortunately if you are on Windows pip install lxml does not work unless you have a whole compiler infrastructure set up, which you probably don’t.

Luckily you can get a precompiled binary installer from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml – make sure you pick the one that matches your version of python and whether it is 32 or 64 bit.

Edit: just for interest, try changing the line

soup = bs.BeautifulSoup(resp.text, 'html.parser')   # use Python's built-in parser instead

See https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser for a list of available parsers.

Answered By: Hugh Bothwell

Using SPY ETF

To obtain an official list of S&P 500 symbols as constituents of the SPY ETF, pandas.read_excel can be used. A package such as openpyxl is also required as it is used internally by pandas.

def list_spy_holdings() -> pd.DataFrame:
    # Ref: https://stackoverflow.com/a/75845569/
    # Source: https://www.ssga.com/us/en/intermediary/etfs/funds/spdr-sp-500-etf-trust-spy
    url = 'https://www.ssga.com/us/en/intermediary/etfs/library-content/products/fund-data/etfs/us/holdings-daily-us-en-spy.xlsx'
    return pd.read_excel(url, engine='openpyxl', skiprows=4).dropna()

Using Wikipedia

To obtain an unofficial list of S&P 500 symbols, pandas.read_html can be used. A parser such as lxml or bs4+html5lib is also required as it is used internally by pandas.

import pandas as pd

def list_wikipedia_sp500() -> pd.DataFrame:
    # Ref: https://stackoverflow.com/a/75845569/
    url = 'https://en.m.wikipedia.org/wiki/List_of_S%26P_500_companies'
    return pd.read_html(url, attrs={'id': 'constituents'}, index_col='Symbol')[0]

>> df = list_wikipedia_sp500()
>> df.head()
           Security             GICS Sector  ...      CIK      Founded
Symbol                                       ...                      
MMM              3M             Industrials  ...    66740         1902
AOS     A. O. Smith             Industrials  ...    91142         1916
ABT          Abbott             Health Care  ...     1800         1888
ABBV         AbbVie             Health Care  ...  1551152  2013 (1888)
ACN       Accenture  Information Technology  ...  1467373         1989
[5 rows x 7 columns]

>> symbols = df.index.to_list()
>> symbols[:5]
['MMM', 'AOS', 'ABT', 'ABBV', 'ACN']

Using Slickcharts

import pandas as pd
import requests

def list_slickcharts_sp500() -> pd.DataFrame:
    # Ref: https://stackoverflow.com/a/75845569/
    url = 'https://www.slickcharts.com/sp500'
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0'  # Default user-agent fails.
    response = requests.get(url, headers={'User-Agent': user_agent})
    return pd.read_html(response.text, match='Symbol', index_col='Symbol')[0]

These were tested with Pandas 1.5.3.

The results can be cached for a certain period of time, e.g. 12 hours, in memory and/or on disk, so as to avoid the risk of excessive repeated calls to the source.

A similar answer for the Nasdaq 100 is here.

Answered By: Asclepius