Extract table data from interactive webpage with python

Question:

I would like to get the table data from a certain website to play with the data statistically, however I’m failing on the interactive button which selects each sector from the linked race. How can I iterate through the button list and store each table in a list or a resulting df? An explanation would be appreciated so I can learn how this works. So far I can only extract the text from the first page:

site = "http://live.fis-ski.com/cc-4023/results-pda.htm" 

options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('chromedriver', options= options)
driver.get(site)

try:
    main = WebDriverWait(driver, 10).until(
         EC.presence_of_element_located((By.ID, 'tab1'))   
        )
    print(main.text)
    result = main.text
except:
     driver.quit()

This gives me just a list of the main page with each sector.

Thanks!

Asked By: Hansson

||

Answers:

With Select you can select the value of the dropdown and change the race type. With .get_attribute('innerText') you can get the values of the hidden rows too (.text doesn’t work for them). With pandas you can store data in a dataframe, eventually saving it to csv.

dropdown = WebDriverWait(driver, 9).until(EC.element_to_be_clickable((By.ID, 'int1')))
races = dropdown.text.replace('Auto follown','').strip().split('n')
data = dict.fromkeys(races)

for race in races:
    print(race)
    Select(dropdown).select_by_visible_text(race)
    time.sleep(1)
    rows = driver.find_elements(By.XPATH, "//div[@id='resultpoint1']/ul/li")
    table = []
    
    for row in rows:
        columns = row.find_elements(By.XPATH, "./div")
        # values = [c.get_attribute('innerText') for c in columns] <-- very slow
        values = driver.execute_script("var result = [];" +
        "var all = arguments[0];" +
        "for (var i=0, max=all.length; i < max; i++) {" +
        "    result.push(all[i].innerText);" +
        "} " +
        " return result;", columns)
        table.append(values)

    columns_class = [c.get_attribute('class').split()[0].replace('_order','_rank') for c in columns]
    column_names = [driver.find_element(By.CSS_SELECTOR, '#result-1 .tableheader .'+ c).text for c in columns_class]
    data[race] = pd.DataFrame(table, columns=column_names)
    data[race].to_csv(race+'.csv', index=False)

Each table is saved in a dict, for example data['2500 m'] prints the following

enter image description here

Answered By: sound wave
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.