Extract table data from interactive webpage with python
Question:
I would like to get the table data from a certain website to play with the data statistically, however I’m failing on the interactive button which selects each sector from the linked race. How can I iterate through the button list and store each table in a list or a resulting df? An explanation would be appreciated so I can learn how this works. So far I can only extract the text from the first page:
site = "http://live.fis-ski.com/cc-4023/results-pda.htm"
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', options= options)
driver.get(site)
try:
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'tab1'))
)
print(main.text)
result = main.text
except:
driver.quit()
This gives me just a list of the main page with each sector.
Thanks!
Answers:
With Select
you can select the value of the dropdown and change the race type. With .get_attribute('innerText')
you can get the values of the hidden rows too (.text
doesn’t work for them). With pandas
you can store data in a dataframe, eventually saving it to csv.
dropdown = WebDriverWait(driver, 9).until(EC.element_to_be_clickable((By.ID, 'int1')))
races = dropdown.text.replace('Auto follown','').strip().split('n')
data = dict.fromkeys(races)
for race in races:
print(race)
Select(dropdown).select_by_visible_text(race)
time.sleep(1)
rows = driver.find_elements(By.XPATH, "//div[@id='resultpoint1']/ul/li")
table = []
for row in rows:
columns = row.find_elements(By.XPATH, "./div")
# values = [c.get_attribute('innerText') for c in columns] <-- very slow
values = driver.execute_script("var result = [];" +
"var all = arguments[0];" +
"for (var i=0, max=all.length; i < max; i++) {" +
" result.push(all[i].innerText);" +
"} " +
" return result;", columns)
table.append(values)
columns_class = [c.get_attribute('class').split()[0].replace('_order','_rank') for c in columns]
column_names = [driver.find_element(By.CSS_SELECTOR, '#result-1 .tableheader .'+ c).text for c in columns_class]
data[race] = pd.DataFrame(table, columns=column_names)
data[race].to_csv(race+'.csv', index=False)
Each table is saved in a dict, for example data['2500 m']
prints the following
I would like to get the table data from a certain website to play with the data statistically, however I’m failing on the interactive button which selects each sector from the linked race. How can I iterate through the button list and store each table in a list or a resulting df? An explanation would be appreciated so I can learn how this works. So far I can only extract the text from the first page:
site = "http://live.fis-ski.com/cc-4023/results-pda.htm"
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver', options= options)
driver.get(site)
try:
main = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, 'tab1'))
)
print(main.text)
result = main.text
except:
driver.quit()
This gives me just a list of the main page with each sector.
Thanks!
With Select
you can select the value of the dropdown and change the race type. With .get_attribute('innerText')
you can get the values of the hidden rows too (.text
doesn’t work for them). With pandas
you can store data in a dataframe, eventually saving it to csv.
dropdown = WebDriverWait(driver, 9).until(EC.element_to_be_clickable((By.ID, 'int1')))
races = dropdown.text.replace('Auto follown','').strip().split('n')
data = dict.fromkeys(races)
for race in races:
print(race)
Select(dropdown).select_by_visible_text(race)
time.sleep(1)
rows = driver.find_elements(By.XPATH, "//div[@id='resultpoint1']/ul/li")
table = []
for row in rows:
columns = row.find_elements(By.XPATH, "./div")
# values = [c.get_attribute('innerText') for c in columns] <-- very slow
values = driver.execute_script("var result = [];" +
"var all = arguments[0];" +
"for (var i=0, max=all.length; i < max; i++) {" +
" result.push(all[i].innerText);" +
"} " +
" return result;", columns)
table.append(values)
columns_class = [c.get_attribute('class').split()[0].replace('_order','_rank') for c in columns]
column_names = [driver.find_element(By.CSS_SELECTOR, '#result-1 .tableheader .'+ c).text for c in columns_class]
data[race] = pd.DataFrame(table, columns=column_names)
data[race].to_csv(race+'.csv', index=False)
Each table is saved in a dict, for example data['2500 m']
prints the following