Beautifulsoup: Scrape Table with Key Word Search

Question:

I’m trying to scrape tables from multiple websites with key words. I want to scrape values from table which fulfill "Cash and cash equivalent" as row header and "2020" as column header at the same time in order to print to excel file in the future. But I cannot get the code work. Hope you can help me on this! Thank you!!

from bs4 import BeautifulSoup
import requests
import time
from pandas import DataFrame
import pandas as pd


#headers={"Content-Type":"text"}
headers = {'User-Agent': '[email protected]'}

urls={'https://www.sec.gov/Archives/edgar/data/1127993/0001091818-21-000003.txt',
      'https://www.sec.gov/Archives/edgar/data/1058307/0001493152-21-003451.txt'}

Cash=[]

for url in urls:
  response = requests.get(url, headers = headers)
  response.raise_for_status()
  time.sleep(0.1)
  soup = BeautifulSoup(response.text,'lxml')

  for table in soup.find_all('table'):
    for tr in table.find_all('tr'):
      row = [td.get_text(strip=True) for td in tr.find_all('td')]
      headers = [header.get_text(strip=True).encode("utf-8") for header in tr[0].find_all("th")]
      try:
        if '2020' in headers[0]:
          if row[0] == 'Cash and cash equivalent':
            Cash_and_cash_equivalent = f'{url}'+ ' ' + headers+ str(row)
            Cash.append(Cash_and_cash_equivalent)
          if row[0] == 'Cash':
            Cash_ = f'{url}'+ ' ' + headers+ str(row)
            Cash.append(Cash_)
      except IndexError:
        continue
print(Cash)


Asked By: Candice LE

||

Answers:

You could do something along these lines:

import requests
import pandas as pd

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

headers = {'User-Agent': '[email protected]'}
r = requests.get('https://www.sec.gov/Archives/edgar/data/1127993/0001091818-21-000003.txt', headers=headers)
dfs = pd.read_html(str(r.text))
    
for x in range(len(dfs)):
    if dfs[x].apply(lambda row: row.astype(str).str.contains('Cash and Cash Equivalents').any(), axis=1).any():
        df = dfs[x]
        df.dropna(how='all')
        new_header = df.iloc[2]
        df = df[3:]
        df.columns = new_header
        display(df) ## or print(df) if you're not in a jupyter notebook

This will return two dataframes, with tables #37 and respectively #71. You may need to improve the table header detection, as only table #71 will come out with proper headers (years).
I tried to look at the second url, however it was hanging for me (huge page).
The printout in terminal will look something like this:

NaN NaN 2020    NaN 2019
3   Cash Flows from Operating Activities    NaN NaN NaN NaN
4   Net loss    NaN $(13,134,778)   NaN $ (2,017,347)
5   Adjustments to reconcile net loss to net cash used in operating activities: NaN NaN NaN NaN
6   Depreciation and amortization   NaN 84940   NaN 7832
7   Amortization of convertible debt discounts  NaN 74775   NaN 60268
8   Accretion and settlement of financing instruments   NaN NaN NaN NaN
9   and change in fair value of derivative liability    NaN 1381363 NaN (1,346,797)
10  Stock compensation and stock issued for services    NaN 2870472 NaN -
11  Stock issued under Put Purchase Agreement   NaN 7865077 NaN -
12  NaN NaN NaN NaN NaN
13  Changes in assets and liabilities:  NaN NaN NaN NaN
14  Accounts receivable NaN (696,710)   NaN 82359
15  Inventories NaN (78,919)    NaN 304970
16  Accounts payable    NaN (1,462,072) NaN (22,995)
17  Accrued expenses    NaN (158,601)   NaN (346,095)
18  Deferred revenue    NaN 431147  NaN (91,453)
19  Net cash used in operating activities   NaN (2,823,306) NaN (3,369,258)
20  NaN NaN NaN NaN NaN
21  Cash Flows from Investing Activities    NaN NaN NaN NaN
22  Acquisition of business, net of cash    NaN -   NaN 2967918
23  Purchases of property and equipment NaN -   NaN (17,636)
24  Net cash provided by investing activities   NaN -   NaN 2950282
25  NaN NaN NaN NaN NaN
26  Cash Flows from Financing Activities    NaN NaN NaN NaN
27  Principal payments on financing lease obligations   NaN -   NaN (1,649)
28  Principal payments on notes payable NaN (774)   NaN -
29  Payments on advances from stockholder, net  NaN (33,110)    NaN -
30  Proceeds from convertible notes payable NaN 840000  NaN 667000
31  Payments on line of credit, net NaN (300,000)   NaN -
32  Proceeds from sale of common stock under Purchase Agreement NaN 2316520 NaN -
33  Net cash provided by financing activities   NaN 2822636 NaN 665351
34  NaN NaN NaN NaN NaN
35  Net Increase (Decrease) in Cash and Cash Equivalents    NaN (670)   NaN 246375
36  NaN NaN NaN NaN NaN
37  Cash, Beginning of Period   NaN 412391  NaN 169430
38  NaN NaN NaN NaN NaN
39  Cash, End of Period NaN $ 411,721   NaN $ 415,805
Answered By: platipus_on_fire