How to use Selenium Python to get a field information of each linked page

Question:

The context is springerlink. For example this series of books
GTM

I want to get the information located at the bottom of each book’s webpage:

book info

All I want is the E-ISBN information on each page.

Is there’s a way(not limited to selenium) that enumerate each book page and get the information?

Asked By: Kushinada

||

Answers:

You can open each book through it’s link within the website in a seperate tab and after switching to the new tab you need to induce WebDriverWait for the visibility_of_element_located() and you can extract any of the desired info. As an example to extract the Hardcover ISBN you can use the following locator strategies:

  • Code Block:

    driver.get('https://www.springer.com/series/136/books')
    WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[data-cc-action='accept']"))).click()
    hrefs = [my_elem.get_attribute("href") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-track='click'][data-track-label^='article'][href]")))]
    for href in hrefs:
        main_window = driver.current_window_handle
        driver.execute_script("window.open('" + href +"');")
        WebDriverWait(driver, 5).until((EC.number_of_windows_to_be(2)))
        windows_after = driver.window_handles
        new_window = [handle for handle in windows_after if handle != main_window][0]
        driver.switch_to.window(new_window)
        print(WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.XPATH, "//span[text()='Hardcover ISBN']//following::span[@class='c-bibliographic-information__value']"))).text)
        driver.close()
        driver.switch_to.window(main_window)
    driver.quit()
    
  • Console Output:

    978-3-031-25631-8
    978-3-031-19706-2
    978-3-031-13378-7
    978-3-031-00941-9
    978-3-031-14204-8
    978-3-030-56692-0
    978-3-030-73838-9
    978-3-030-71249-5
    978-3-030-35117-5
    978-3-030-59241-7
    
Answered By: undetected Selenium

For this easy task you can use both Selenium and BeautifulSoup, but the latter is easier and faster so let’s use it to get title and E-ISBN codes.

First install BeautifulSoup with the command pip install beautifulsoup4.

Method 1 (faster): get E-ISBN directly from books list

Notice that in the books list for each book there is an eBook link, which is something like https://www.springer.com/book/9783031256325 where 9783031256325 is the EISBN code without the - characters.

enter image description here

So we can get the EISBN codes directly from those urls, without the need to load a new page for each book:

import requests
from bs4 import BeautifulSoup

url = 'https://www.springer.com/series/136/books'
soup = BeautifulSoup(requests.get(url).text, "html.parser")
titles = [title.text.strip() for title in soup.select('.c-card__title')]
EISBN = []
for a in soup.select('ul:last-child .c-meta__item:last-child a'):
    c = a['href'].split('/')[-1] # a['href'] is something like https://www.springer.com/book/9783031256325
    EISBN.append( f'{c[:3]}-{c[3]}-{c[4:7]}-{c[7:12]}-{c[-1]}' ) # insert four '-' in the number 9783031256325 to create the E-ISBN code

for i in range(len(titles)):
    print(EISBN[i],titles[i])

Output

978-3-031-25632-5 Random Walks on Infinite Groups
978-3-031-19707-9 Drinfeld Modules
978-3-031-13379-4 Partial Differential Equations
978-3-031-00943-3 Stationary Processes and Discrete Parameter Markov Processes
978-3-031-14205-5 Measure Theory, Probability, and Stochastic Processes
978-3-030-56694-4 Quaternion Algebras
978-3-030-73839-6 Mathematical Logic
978-3-030-71250-1 Lessons in Enumerative Combinatorics
978-3-030-35118-2 Basic Representation Theory of Algebras
978-3-030-59242-4 Ergodic Dynamics

Method 2 (slower): get E-ISBN by loading a page for each book

This method load the details page for each book and extract from there the EISBN code:

soup = BeautifulSoup(requests.get(url).text, "html.parser")
books = soup.select('a[data-track-label^="article"]')
titles, EISBN = [], []

for book in books:
    titles.append(book.text.strip())
    soup_book = BeautifulSoup(requests.get(book['href']).text, "html.parser")
    EISBN.append( soup_book.select('p:has(span[data-test=electronic_isbn_publication_date]) .c-bibliographic-information__value')[0].text )

If you are wondering p:has(span[data-test=electronic_isbn_publication_date]) select the parent p of the span having attribute data-test=electronic_isbn_publication_date.

Answered By: sound wave