How can I Extract a Table Element into a csv or data frame in Selenium/BS?

Question:

So, my code looks like this:

from bs4 import BeautifulSoup 
import requests
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import pyautogui
from selenium.webdriver.support.ui import WebDriverWait
keyword='milk'
browser = webdriver.Edge(r"C:UserssolanDocumentsedgedriver_win32msedgedriver.exe")
browser.get('https://fdc.nal.usda.gov/fdc-app.html#/?query='+keyword+'')
#element1= browser.find_element_by_xpath('/html/body/div/main/app-root/app-food-search/div/div/div[1]/div[4]/table')
element1= browser.find_elements_by_xpath('//a[@class="result-description"]')
#data=element1.text
for item in element1:
     print(item.text)

Food= input("")
time.sleep(10)
z=browser.find_element_by_link_text(Food).click() 

It outputs a list, from which I select "Yogurt, plain, whole milk" and click enter. On this page there is a Table of food contents. I would like to extract the table directly into Pd dataframe or a CSV.

I am trying this to get the table contents:

for table in browser.find_elements_by_xpath('/html/body/div/main/app-root/app-food-details/div/div/div[2]/div/div/div/app-food-nutrients/div/div[2]/table'):
    print(table.text)

Which outputs:
Image

The table.text is a str and I am not quite sure how could I fit it in a csv or df. Even if I try to fit, it just fits itself in a single row. It doesnt detect a table format. Does anyone have any suggestions?

Asked By: Rahul Solanki

||

Answers:

That table is being hydrated via an XHR network call (see Dev Tools – network tab). You can do something like this, avoiding the overheads of selenium and whatever heavy artillery you are using:

import requests
import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url = 'https://fdc.nal.usda.gov/portal-data/external/2259793'
r = requests.get(url, headers=headers)
#print(r.json())
df = pd.json_normalize(r.json()['foodNutrients'])
print(df)

This will return a dataframe (which you can further save to csv, if you want):

[...] (too big to post it here)

​You can inspect further that json response, and eventually try to flatten (normalize) it, or you can select only specific columns from that dataframe. Pandas docs relevant to reading json: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

EDIT: here is a solution based on selenium & pandas, returning only the visible values in that table:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import json
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

url = 'https://fdc.nal.usda.gov/fdc-app.html#/?query=milk' 

df_list = []
browser.get(url) 
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, 'Yogurt, plain, whole milk'))).click()
table_w_data = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[@id="myTabContent"]//table[@id="nutrients-table"]')))
t_header = WebDriverWait(table_w_data, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'th')))
columns = [x.text.strip() for x in t_header if len(x.text.strip())> 0]
print(columns)
columns.remove('Footnote')
print(columns)
rows = WebDriverWait(table_w_data, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'tr')))
for row in rows:
    tds = row.find_elements(By.TAG_NAME, 'td')
    if len(tds) > 1 and len(row.text) > 5:
        print([x.text.strip() for x in tds if len(x.text) > 0])
        df_list.append(([x.text.strip() for x in tds if len(x.text) > 0]))
        print('_______________________')

df = pd.DataFrame(df_list, columns = columns)
display(df)
df.to_csv('milk_stuffs_nutritional_vals.csv')

This ran quite slow on my machine. It returns a dataframe with those values, it also saved the values to csv, and it looks like this:

    Name    Average Amount  Unit    Deriv. By   n   Samples Min Max Median  Initial Year Acquired
0   Proximates: None    None    None    None    None    None    None    None    None
1   Water   85.3    g   Analytical  8   Samples 81.7    87.4    86.4    2021
2   Energy (Atwater General Factors)    78  kcal    Calculated  None    None    None    None    None    None
3   Energy (Atwater Specific Factors)   77  kcal    Calculated  None    None    None    None    None    None
4   Nitrogen    0.6 g   Analytical  8   Samples 0.49    0.79    0.56    2021
5   Protein 3.82    g   Calculated  3.13    5.04    3.59    None    None    None
6   Total lipid (fat)   4.48    g   Analytical  8   Samples 3.66    5.66    4.52    2021
7   Ash 0.85    g   Analytical  8   Samples 0.67    1.07    0.77    2021
8   Carbohydrates:  None    None    None    None    None    None    None    None    None
9   Carbohydrate, by difference 5.57    g   Calculated  None    None    None    None    None    None
10  Sugars, Total NLEA  4.09    g   Summed  None    None    None    None    None    None
11  Sucrose <0.25   g   Analytical  8   Samples 2021    None    None    None
12  Glucose <0.25   g   Analytical  8   Samples 2021    None    None    None
13  Fructose    <0.25   g   Analytical  8   Samples 2021    None    None    None
14  Lactose 3.35    g   Analytical  8   Samples 2.51    4.3 3.21    2021
15  Maltose <0.25   g   Analytical  8   Samples 2021    None    None    None
16  Galactose   0.75    g   Analytical  8   Samples 0.56    0.84    0.78    2021
17  Minerals:   None    None    None    None    None    None    None    None    None
18  Calcium, Ca 127 mg  Analytical  8   Samples 101 163 121 2021
19  Iron, Fe    <0.1    mg  Analytical  8   Samples 2021    None    None    None
20  Magnesium, Mg   11.4    mg  Analytical  8   Samples 8.7 15.2    10.9    2021
21  Phosphorus, P   101 mg  Analytical  8   Samples 78  137 95  2021
22  Potassium, K    164 mg  Analytical  8   Samples 127 212 160 2021
23  Sodium, Na  42  mg  Analytical  8   Samples 36  55  40  2021
24  Zinc, Zn    0.43    mg  Analytical  8   Samples 0.32    0.58    0.4 2021
25  Copper, Cu  0.003   mg  Analytical  8   Samples 0   0.014   0   2021
26  Manganese, Mn   0.002   mg  Analytical  8   Samples 0   0.007   0   2021
27  Iodine, I   32.3    µg  Analytical  8   Samples 22  57.1    26.7    2021
28  Vitamins and Other Components:  None    None    None    None    None    None    None    None    None
29  Thiamin 0.055   mg  Analytical  8   Samples 0.045   0.07    0.052   2021
30  Riboflavin  0.243   mg  Analytical  8   Samples 0.19    0.29    0.242   2021
31  Niacin  0.135   mg  Analytical  8   Samples 0.09    0.18    0.14    2021
32  Vitamin B-6 0.045   mg  Analytical  8   Samples 0.032   0.07    0.044   2021
33  Biotin  <3.7    µg  Analytical  8   Samples 2021    None    None    None
34  Vitamin A   None    None    None    None    None    None    None    None    None
35  Retinol 48  µg  Analytical  8   Samples 38  78  43  2021
36  Vitamin D (D2 + D3), International Units    31.1    IU  Calculated  None    None    None    None    None    None
37  Vitamin D (D2 + D3) 0.78    µg  Calculated  None    None    None    None    None    None
38  Vitamin D2 (ergocalciferol) <0.1    µg  Analytical  8   Samples 2021    None    None    None
39  Vitamin D3 (cholecalciferol)    0.78    µg  Analytical  8   Samples 0   1.39    1.1 2021
40  Lipids: None    None    None    None    None    None    None    None    None
41  Fatty acids, total saturated    2.32    g   Analytical  8   Samples 1.83    3.05    2.18    2021
42  SFA 12:0    0.115   g   Analytical  8   Samples 0.088   0.156   0.108   2021
43  SFA 14:0    0.382   g   Analytical  8   Samples 0.306   0.482   0.369   2021
44  SFA 16:0    1.07    g   Analytical  8   Samples 0.787   1.51    0.999   2021
45  SFA 18:0    0.372   g   Analytical  8   Samples 0.268   0.445   0.39    2021
46  Fatty acids, total monounsaturated  0.874   g   Analytical  8   Samples 0.71    1.14    0.84    2021
47  MUFA 18:1 c 0.766   g   Analytical  8   Samples 0.615   1   0.739   2021
48  PUFA 18:2 n-6 c,c   0.082   g   Analytical  8   Samples 0.054   0.127   0.074   2021
49  Cholesterol 14  mg  Analytical  8   Samples 12  18  14  2021
Answered By: platipus_on_fire

To extract the <table> data you need to induce WebDriverWait for the visibility_of_element_located() of the <table> element, extract the outerHTML, read the outerHTML using read_html() and you can use the following Locator Strategies:

  • Code Block:

    driver.get('https://fdc.nal.usda.gov/fdc-app.html#/?query=milk')
    time.sleep(5)
    data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.usa-table-results.usa-table-borderless.header-alignment"))).get_attribute("outerHTML")
    df = pd.read_html(data)
    print(df)
    driver.quit()
    
  • Console Output:

    [    NDB Number  ...             SR Food Category
    0         1036  ...       Dairy and Egg Products
    1         1116  ...       Dairy and Egg Products
    2       100276  ...                    Beverages
    3         1019  ...       Dairy and Egg Products
    4       100277  ...                    Beverages
    5       100275  ...  Legumes and Legume Products
    6         1293  ...       Dairy and Egg Products
    7        14091  ...                    Beverages
    8        16222  ...  Legumes and Legume Products
    9         1077  ...       Dairy and Egg Products
    10        1082  ...       Dairy and Egg Products
    11        1085  ...       Dairy and Egg Products
    12        1079  ...       Dairy and Egg Products
    
    [13 rows x 4 columns]]
    

Now you can easily copy the data into a csv file as follows:

df[0].to_csv("my_data.csv", index=False)

Snapshot of the csv file:

csv_file

Answered By: undetected Selenium