How to get more detail when scraping from a web-based table?

Question:

I’m playing around with a few options to screen scrape a corporate web site. It’s behind a firewall so I can’t share it here. Anyway, I’ve got everything working pretty well, with one exception. I can’t seem to get the level of detail that I would like to see.

I’m using Selenium and this one line of code to grab data from each table on each URL.

element = wd.find_element_by_id('dags')

So, I reference the table, dump the data, and write everything to a text file. I can get analytics for a bunch of tasks, and the analytics could look like this:

1 0 2 1 4 2

The problem is, when I open the CSV file, I will see this:

102142

So, everything is bunched up together. Is there a better way to get data from a web table? I was thinking of pulling TR elements and TD elements, but I’m not sure how that would work.

Asked By: ASH

||

Answers:

Providing your source could clarify your question a bit more, but to give you an example, if we were to scrape data from a wikipedia table:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

driver.get(url)

WebDriverWait(driver, 20).until(
     EC.presence_of_element_located((By.XPATH, """//*[@id="mw-content-text"]/div/table[2]/tbody/tr[2]/td[1]/table/tbody""")))


rankings = driver.find_elements_by_xpath("""//*[@id="mw-content-text"]/div/table[2]/tbody/tr[2]/td[1]/table/tbody/tr/td[1]""")
rankings = [x.text for x in rankings]

countries = driver.find_elements_by_xpath("""//*[@id="mw-content-text"]/div/table[2]/tbody/tr[2]/td[1]/table/tbody/tr/td[2]""")
countries = [x.text for x in countries]

gdps = driver.find_elements_by_xpath("""//*[@id="mw-content-text"]/div/table[2]/tbody/tr[2]/td[1]/table/tbody/tr/td[3]""")
gdps = [x.text for x in gdps]

data = zip(rankings, countries, gdps)

print data[:10]

the output would be like:

[(u' ', u' World[19]', u'79,865,481'), (u'1', u' United States', u'19,390,600'), (u'2', u' China[n 1]', u'12,014,610'), (u'3', u' Japan', u'4,872,135'), (u'4', u' Germany', u'3,684,816'), (u'5', u' United Kingdom', u'2,624,529'), (u'6', u' India', u'2,611,012'), (u'7', u' France', u'2,583,560'), (u'8', u' Brazil', u'2,054,969'), (u'9', u' Italy', u'1,937,894')]

Once you have a data structure like this, it would be easy to write on a csv or other form of persisting.

Let me know if this helps.

Answered By: Lafa