Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup
Question:
I am attempting to scrape individual fund holdings from the SEC’s N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:
<edgarSubmission rel="nofollow noreferrer">https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml
Answers:
Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.
To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.
Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml()
can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests
library as well.
So, all together:
#import required libraries
import pandas as pd
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']
#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}
#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)
#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]
#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)
This should output your 126 debt securities (pardon the formatting):
lei title cusip balance units pctVal payoffProfile assetCat issuerCat invCountry maturityDt couponKind annualizedRt
0 ARROW BIDCO LLC 549300YHZN08M0H3O128 Arrow Bidco LLC 042728AA3 115000.00 PA 0.396755 Long DBT CORP US 2024-03-15 Fixed 9.50000
1 CD&R SMOKEY BUYER INC NaN CD&R Smokey Buyer Inc 12510CAA9 165000.00 PA 0.505585 Long DBT CORP US 2025-07-15 Fixed 6.75000
You can then play with the final df, add or drop columns, etc
I am attempting to scrape individual fund holdings from the SEC’s N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:
<edgarSubmission rel="nofollow noreferrer">https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml
Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.
To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.
Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml()
can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests
library as well.
So, all together:
#import required libraries
import pandas as pd
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']
#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}
#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)
#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]
#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)
This should output your 126 debt securities (pardon the formatting):
lei title cusip balance units pctVal payoffProfile assetCat issuerCat invCountry maturityDt couponKind annualizedRt
0 ARROW BIDCO LLC 549300YHZN08M0H3O128 Arrow Bidco LLC 042728AA3 115000.00 PA 0.396755 Long DBT CORP US 2024-03-15 Fixed 9.50000
1 CD&R SMOKEY BUYER INC NaN CD&R Smokey Buyer Inc 12510CAA9 165000.00 PA 0.505585 Long DBT CORP US 2025-07-15 Fixed 6.75000
You can then play with the final df, add or drop columns, etc