How can I Extract a Table Element into a csv or data frame in Selenium/BS?
Question:
So, my code looks like this:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import pyautogui
from selenium.webdriver.support.ui import WebDriverWait
keyword='milk'
browser = webdriver.Edge(r"C:UserssolanDocumentsedgedriver_win32msedgedriver.exe")
browser.get('https://fdc.nal.usda.gov/fdc-app.html#/?query='+keyword+'')
#element1= browser.find_element_by_xpath('/html/body/div/main/app-root/app-food-search/div/div/div[1]/div[4]/table')
element1= browser.find_elements_by_xpath('//a[@class="result-description"]')
#data=element1.text
for item in element1:
print(item.text)
Food= input("")
time.sleep(10)
z=browser.find_element_by_link_text(Food).click()
It outputs a list, from which I select "Yogurt, plain, whole milk" and click enter. On this page there is a Table of food contents. I would like to extract the table directly into Pd dataframe or a CSV.
I am trying this to get the table contents:
for table in browser.find_elements_by_xpath('/html/body/div/main/app-root/app-food-details/div/div/div[2]/div/div/div/app-food-nutrients/div/div[2]/table'):
print(table.text)
Which outputs:
Image
The table.text is a str and I am not quite sure how could I fit it in a csv or df. Even if I try to fit, it just fits itself in a single row. It doesnt detect a table format. Does anyone have any suggestions?
Answers:
That table is being hydrated via an XHR network call (see Dev Tools – network tab). You can do something like this, avoiding the overheads of selenium and whatever heavy artillery you are using:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url = 'https://fdc.nal.usda.gov/portal-data/external/2259793'
r = requests.get(url, headers=headers)
#print(r.json())
df = pd.json_normalize(r.json()['foodNutrients'])
print(df)
This will return a dataframe (which you can further save to csv, if you want):
[...] (too big to post it here)
You can inspect further that json response, and eventually try to flatten (normalize) it, or you can select only specific columns from that dataframe. Pandas docs relevant to reading json: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
EDIT: here is a solution based on selenium & pandas, returning only the visible values in that table:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import json
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://fdc.nal.usda.gov/fdc-app.html#/?query=milk'
df_list = []
browser.get(url)
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, 'Yogurt, plain, whole milk'))).click()
table_w_data = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[@id="myTabContent"]//table[@id="nutrients-table"]')))
t_header = WebDriverWait(table_w_data, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'th')))
columns = [x.text.strip() for x in t_header if len(x.text.strip())> 0]
print(columns)
columns.remove('Footnote')
print(columns)
rows = WebDriverWait(table_w_data, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'tr')))
for row in rows:
tds = row.find_elements(By.TAG_NAME, 'td')
if len(tds) > 1 and len(row.text) > 5:
print([x.text.strip() for x in tds if len(x.text) > 0])
df_list.append(([x.text.strip() for x in tds if len(x.text) > 0]))
print('_______________________')
df = pd.DataFrame(df_list, columns = columns)
display(df)
df.to_csv('milk_stuffs_nutritional_vals.csv')
This ran quite slow on my machine. It returns a dataframe with those values, it also saved the values to csv, and it looks like this:
Name Average Amount Unit Deriv. By n Samples Min Max Median Initial Year Acquired
0 Proximates: None None None None None None None None None
1 Water 85.3 g Analytical 8 Samples 81.7 87.4 86.4 2021
2 Energy (Atwater General Factors) 78 kcal Calculated None None None None None None
3 Energy (Atwater Specific Factors) 77 kcal Calculated None None None None None None
4 Nitrogen 0.6 g Analytical 8 Samples 0.49 0.79 0.56 2021
5 Protein 3.82 g Calculated 3.13 5.04 3.59 None None None
6 Total lipid (fat) 4.48 g Analytical 8 Samples 3.66 5.66 4.52 2021
7 Ash 0.85 g Analytical 8 Samples 0.67 1.07 0.77 2021
8 Carbohydrates: None None None None None None None None None
9 Carbohydrate, by difference 5.57 g Calculated None None None None None None
10 Sugars, Total NLEA 4.09 g Summed None None None None None None
11 Sucrose <0.25 g Analytical 8 Samples 2021 None None None
12 Glucose <0.25 g Analytical 8 Samples 2021 None None None
13 Fructose <0.25 g Analytical 8 Samples 2021 None None None
14 Lactose 3.35 g Analytical 8 Samples 2.51 4.3 3.21 2021
15 Maltose <0.25 g Analytical 8 Samples 2021 None None None
16 Galactose 0.75 g Analytical 8 Samples 0.56 0.84 0.78 2021
17 Minerals: None None None None None None None None None
18 Calcium, Ca 127 mg Analytical 8 Samples 101 163 121 2021
19 Iron, Fe <0.1 mg Analytical 8 Samples 2021 None None None
20 Magnesium, Mg 11.4 mg Analytical 8 Samples 8.7 15.2 10.9 2021
21 Phosphorus, P 101 mg Analytical 8 Samples 78 137 95 2021
22 Potassium, K 164 mg Analytical 8 Samples 127 212 160 2021
23 Sodium, Na 42 mg Analytical 8 Samples 36 55 40 2021
24 Zinc, Zn 0.43 mg Analytical 8 Samples 0.32 0.58 0.4 2021
25 Copper, Cu 0.003 mg Analytical 8 Samples 0 0.014 0 2021
26 Manganese, Mn 0.002 mg Analytical 8 Samples 0 0.007 0 2021
27 Iodine, I 32.3 µg Analytical 8 Samples 22 57.1 26.7 2021
28 Vitamins and Other Components: None None None None None None None None None
29 Thiamin 0.055 mg Analytical 8 Samples 0.045 0.07 0.052 2021
30 Riboflavin 0.243 mg Analytical 8 Samples 0.19 0.29 0.242 2021
31 Niacin 0.135 mg Analytical 8 Samples 0.09 0.18 0.14 2021
32 Vitamin B-6 0.045 mg Analytical 8 Samples 0.032 0.07 0.044 2021
33 Biotin <3.7 µg Analytical 8 Samples 2021 None None None
34 Vitamin A None None None None None None None None None
35 Retinol 48 µg Analytical 8 Samples 38 78 43 2021
36 Vitamin D (D2 + D3), International Units 31.1 IU Calculated None None None None None None
37 Vitamin D (D2 + D3) 0.78 µg Calculated None None None None None None
38 Vitamin D2 (ergocalciferol) <0.1 µg Analytical 8 Samples 2021 None None None
39 Vitamin D3 (cholecalciferol) 0.78 µg Analytical 8 Samples 0 1.39 1.1 2021
40 Lipids: None None None None None None None None None
41 Fatty acids, total saturated 2.32 g Analytical 8 Samples 1.83 3.05 2.18 2021
42 SFA 12:0 0.115 g Analytical 8 Samples 0.088 0.156 0.108 2021
43 SFA 14:0 0.382 g Analytical 8 Samples 0.306 0.482 0.369 2021
44 SFA 16:0 1.07 g Analytical 8 Samples 0.787 1.51 0.999 2021
45 SFA 18:0 0.372 g Analytical 8 Samples 0.268 0.445 0.39 2021
46 Fatty acids, total monounsaturated 0.874 g Analytical 8 Samples 0.71 1.14 0.84 2021
47 MUFA 18:1 c 0.766 g Analytical 8 Samples 0.615 1 0.739 2021
48 PUFA 18:2 n-6 c,c 0.082 g Analytical 8 Samples 0.054 0.127 0.074 2021
49 Cholesterol 14 mg Analytical 8 Samples 12 18 14 2021
To extract the <table>
data you need to induce WebDriverWait for the visibility_of_element_located() of the <table>
element, extract the outerHTML, read the outerHTML using read_html()
and you can use the following Locator Strategies:
-
Code Block:
driver.get('https://fdc.nal.usda.gov/fdc-app.html#/?query=milk')
time.sleep(5)
data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.usa-table-results.usa-table-borderless.header-alignment"))).get_attribute("outerHTML")
df = pd.read_html(data)
print(df)
driver.quit()
-
Console Output:
[ NDB Number ... SR Food Category
0 1036 ... Dairy and Egg Products
1 1116 ... Dairy and Egg Products
2 100276 ... Beverages
3 1019 ... Dairy and Egg Products
4 100277 ... Beverages
5 100275 ... Legumes and Legume Products
6 1293 ... Dairy and Egg Products
7 14091 ... Beverages
8 16222 ... Legumes and Legume Products
9 1077 ... Dairy and Egg Products
10 1082 ... Dairy and Egg Products
11 1085 ... Dairy and Egg Products
12 1079 ... Dairy and Egg Products
[13 rows x 4 columns]]
Now you can easily copy the data into a csv file as follows:
df[0].to_csv("my_data.csv", index=False)
Snapshot of the csv file:
So, my code looks like this:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver import ActionChains
from selenium.webdriver.common.keys import Keys
import pyautogui
from selenium.webdriver.support.ui import WebDriverWait
keyword='milk'
browser = webdriver.Edge(r"C:UserssolanDocumentsedgedriver_win32msedgedriver.exe")
browser.get('https://fdc.nal.usda.gov/fdc-app.html#/?query='+keyword+'')
#element1= browser.find_element_by_xpath('/html/body/div/main/app-root/app-food-search/div/div/div[1]/div[4]/table')
element1= browser.find_elements_by_xpath('//a[@class="result-description"]')
#data=element1.text
for item in element1:
print(item.text)
Food= input("")
time.sleep(10)
z=browser.find_element_by_link_text(Food).click()
It outputs a list, from which I select "Yogurt, plain, whole milk" and click enter. On this page there is a Table of food contents. I would like to extract the table directly into Pd dataframe or a CSV.
I am trying this to get the table contents:
for table in browser.find_elements_by_xpath('/html/body/div/main/app-root/app-food-details/div/div/div[2]/div/div/div/app-food-nutrients/div/div[2]/table'):
print(table.text)
Which outputs:
Image
The table.text is a str and I am not quite sure how could I fit it in a csv or df. Even if I try to fit, it just fits itself in a single row. It doesnt detect a table format. Does anyone have any suggestions?
That table is being hydrated via an XHR network call (see Dev Tools – network tab). You can do something like this, avoiding the overheads of selenium and whatever heavy artillery you are using:
import requests
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
url = 'https://fdc.nal.usda.gov/portal-data/external/2259793'
r = requests.get(url, headers=headers)
#print(r.json())
df = pd.json_normalize(r.json()['foodNutrients'])
print(df)
This will return a dataframe (which you can further save to csv, if you want):
[...] (too big to post it here)
You can inspect further that json response, and eventually try to flatten (normalize) it, or you can select only specific columns from that dataframe. Pandas docs relevant to reading json: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
EDIT: here is a solution based on selenium & pandas, returning only the visible values in that table:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import json
import pandas as pd
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = 'https://fdc.nal.usda.gov/fdc-app.html#/?query=milk'
df_list = []
browser.get(url)
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.PARTIAL_LINK_TEXT, 'Yogurt, plain, whole milk'))).click()
table_w_data = WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.XPATH, '//div[@id="myTabContent"]//table[@id="nutrients-table"]')))
t_header = WebDriverWait(table_w_data, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'th')))
columns = [x.text.strip() for x in t_header if len(x.text.strip())> 0]
print(columns)
columns.remove('Footnote')
print(columns)
rows = WebDriverWait(table_w_data, 20).until(EC.presence_of_all_elements_located((By.TAG_NAME, 'tr')))
for row in rows:
tds = row.find_elements(By.TAG_NAME, 'td')
if len(tds) > 1 and len(row.text) > 5:
print([x.text.strip() for x in tds if len(x.text) > 0])
df_list.append(([x.text.strip() for x in tds if len(x.text) > 0]))
print('_______________________')
df = pd.DataFrame(df_list, columns = columns)
display(df)
df.to_csv('milk_stuffs_nutritional_vals.csv')
This ran quite slow on my machine. It returns a dataframe with those values, it also saved the values to csv, and it looks like this:
Name Average Amount Unit Deriv. By n Samples Min Max Median Initial Year Acquired
0 Proximates: None None None None None None None None None
1 Water 85.3 g Analytical 8 Samples 81.7 87.4 86.4 2021
2 Energy (Atwater General Factors) 78 kcal Calculated None None None None None None
3 Energy (Atwater Specific Factors) 77 kcal Calculated None None None None None None
4 Nitrogen 0.6 g Analytical 8 Samples 0.49 0.79 0.56 2021
5 Protein 3.82 g Calculated 3.13 5.04 3.59 None None None
6 Total lipid (fat) 4.48 g Analytical 8 Samples 3.66 5.66 4.52 2021
7 Ash 0.85 g Analytical 8 Samples 0.67 1.07 0.77 2021
8 Carbohydrates: None None None None None None None None None
9 Carbohydrate, by difference 5.57 g Calculated None None None None None None
10 Sugars, Total NLEA 4.09 g Summed None None None None None None
11 Sucrose <0.25 g Analytical 8 Samples 2021 None None None
12 Glucose <0.25 g Analytical 8 Samples 2021 None None None
13 Fructose <0.25 g Analytical 8 Samples 2021 None None None
14 Lactose 3.35 g Analytical 8 Samples 2.51 4.3 3.21 2021
15 Maltose <0.25 g Analytical 8 Samples 2021 None None None
16 Galactose 0.75 g Analytical 8 Samples 0.56 0.84 0.78 2021
17 Minerals: None None None None None None None None None
18 Calcium, Ca 127 mg Analytical 8 Samples 101 163 121 2021
19 Iron, Fe <0.1 mg Analytical 8 Samples 2021 None None None
20 Magnesium, Mg 11.4 mg Analytical 8 Samples 8.7 15.2 10.9 2021
21 Phosphorus, P 101 mg Analytical 8 Samples 78 137 95 2021
22 Potassium, K 164 mg Analytical 8 Samples 127 212 160 2021
23 Sodium, Na 42 mg Analytical 8 Samples 36 55 40 2021
24 Zinc, Zn 0.43 mg Analytical 8 Samples 0.32 0.58 0.4 2021
25 Copper, Cu 0.003 mg Analytical 8 Samples 0 0.014 0 2021
26 Manganese, Mn 0.002 mg Analytical 8 Samples 0 0.007 0 2021
27 Iodine, I 32.3 µg Analytical 8 Samples 22 57.1 26.7 2021
28 Vitamins and Other Components: None None None None None None None None None
29 Thiamin 0.055 mg Analytical 8 Samples 0.045 0.07 0.052 2021
30 Riboflavin 0.243 mg Analytical 8 Samples 0.19 0.29 0.242 2021
31 Niacin 0.135 mg Analytical 8 Samples 0.09 0.18 0.14 2021
32 Vitamin B-6 0.045 mg Analytical 8 Samples 0.032 0.07 0.044 2021
33 Biotin <3.7 µg Analytical 8 Samples 2021 None None None
34 Vitamin A None None None None None None None None None
35 Retinol 48 µg Analytical 8 Samples 38 78 43 2021
36 Vitamin D (D2 + D3), International Units 31.1 IU Calculated None None None None None None
37 Vitamin D (D2 + D3) 0.78 µg Calculated None None None None None None
38 Vitamin D2 (ergocalciferol) <0.1 µg Analytical 8 Samples 2021 None None None
39 Vitamin D3 (cholecalciferol) 0.78 µg Analytical 8 Samples 0 1.39 1.1 2021
40 Lipids: None None None None None None None None None
41 Fatty acids, total saturated 2.32 g Analytical 8 Samples 1.83 3.05 2.18 2021
42 SFA 12:0 0.115 g Analytical 8 Samples 0.088 0.156 0.108 2021
43 SFA 14:0 0.382 g Analytical 8 Samples 0.306 0.482 0.369 2021
44 SFA 16:0 1.07 g Analytical 8 Samples 0.787 1.51 0.999 2021
45 SFA 18:0 0.372 g Analytical 8 Samples 0.268 0.445 0.39 2021
46 Fatty acids, total monounsaturated 0.874 g Analytical 8 Samples 0.71 1.14 0.84 2021
47 MUFA 18:1 c 0.766 g Analytical 8 Samples 0.615 1 0.739 2021
48 PUFA 18:2 n-6 c,c 0.082 g Analytical 8 Samples 0.054 0.127 0.074 2021
49 Cholesterol 14 mg Analytical 8 Samples 12 18 14 2021
To extract the <table>
data you need to induce WebDriverWait for the visibility_of_element_located() of the <table>
element, extract the outerHTML, read the outerHTML using read_html()
and you can use the following Locator Strategies:
-
Code Block:
driver.get('https://fdc.nal.usda.gov/fdc-app.html#/?query=milk') time.sleep(5) data = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table.usa-table-results.usa-table-borderless.header-alignment"))).get_attribute("outerHTML") df = pd.read_html(data) print(df) driver.quit()
-
Console Output:
[ NDB Number ... SR Food Category 0 1036 ... Dairy and Egg Products 1 1116 ... Dairy and Egg Products 2 100276 ... Beverages 3 1019 ... Dairy and Egg Products 4 100277 ... Beverages 5 100275 ... Legumes and Legume Products 6 1293 ... Dairy and Egg Products 7 14091 ... Beverages 8 16222 ... Legumes and Legume Products 9 1077 ... Dairy and Egg Products 10 1082 ... Dairy and Egg Products 11 1085 ... Dairy and Egg Products 12 1079 ... Dairy and Egg Products [13 rows x 4 columns]]
Now you can easily copy the data into a csv file as follows:
df[0].to_csv("my_data.csv", index=False)
Snapshot of the csv file: