Selenium XPATH keeps grabbing the wrong tag element within loop

Question:

I’m currently web scraping my university webpage to download unit content. I’ve figured out how to collect the names/links to each unit, and am now currently trying to figure out how to collate the names/links to each individual module within a unit.

A rough description of the HTML on the modules page.

<ul id="content_listContainer" class="contentList">
    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>

    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>
</ul>

So I am trying to grab the link inside the href attribute of the <a> tag within li/div/h3 and the name of the module within the span inside the <a> tag. Here is the relevant code snippet.

    modules = []
   
    driver.get(unit_url)

    module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")    #Grab the ul list

    li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']")  #Grab each li item

    for item in li_items[1:]:              #Skips first li tag as that is the Overview, not a module

        module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href') 
                                                      #These are not moving on from the first module for some reason...
        module_name = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a/span").text

        module = {
            "name": module_name,
            "url": module_url
        }

        modules.append(module)

The issue/question:

Edit

I’ve tried @sushii and @QHarr solutions with no luck unfortunately. I should point out that the lines grabbing module_name and module_url within the for loop are returning the same first module data every LOOP. I’ve tested it with a different unit where the first couple <li> tags are non-modules (introduction) and that should be returned but it is still only returning the same module 1.

/edit

Edit 2

Here is a link to the html I am trying to scrape. This isn’t the entire page as that would be way to big.

<html><body><div></div><div></div><div></div><div> This is the DIV that is in the link </div><div></div><div></div></body></html>

I have verified that li_items definitely contains the <li> tags I need so the other HTML shouldn’t be important (I think).

If you scroll about a quarter way down the <li> tags I need are bolded and the information I need to scrape is underlined.

/Edit 2

The lines that grab the module_name and module_url within the for loop are only grabbing the info for the first module.

I have verified through debugging that li_items does contain all the li items and is not just grabbing the first one. I’m new to Selenium so my thinking is that there is something wrong with the xpath I have provided but it should only be grabbing the tags within the item iterable object. So I am confused as to why it keep grabbing the first li item’s info.

Answer Edit

Using @Sariq Shaikh ‘s answer I’ve solved the issue. Initially his technique using indexing [] of the elements to iterate over the <li> tags wasn’t working but after altering the XPATH used for module_url and module_name to include the <ul> tag and then using indexing with the <li> tag has solved my issue.

However I still do not undestand why the original method was not working. Here is the altered code.

    module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")

    ctr = 1

    for _ in module_ul.find_elements_by_tag_name('li'):
        
        try:

            module_url = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a').get_attribute('href') #These are not moving on from the first module for some reason...

            module_name = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a/span').text
        
        except SelException.NoSuchElementException:

            print("NoSuchElementExceptionn")
            ctr += 1
            continue
Asked By: Vehicular IT

||

Answers:

This is actually very easy with BeautifulSoup. Here is how u do it using BeautifulSoup:

from bs4 import BeautifulSoup
html = """
<ul id="content_listContainer" class="contentList">
    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>

    <li id="" class="clearfix liItem read">
        <img></img>
        <div class="item clearfix">
             <h3>
                  <a href="Link To Module">
                       <span>Name of Module</span>
                  </a>
             </h3>
        </div>
    </li>
</ul>
"""
soup = BeautifulSoup(html,'html.parser')

lis = soup.find_all('li',class_ = 'clearfix liItem read')

for li in lis:
    print(li.div.h3.a['href'])

Output:

Link To Module
Link To Module

Hope that this helps!

EDIT:

Since ur website is dynamically loaded using javascript, u shd first open the url in selenium, get the html code of the website and close the browser. Here is how u do it:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source

U can then parse this html using BeautifulSoup. Hope that this helps!

Answered By: Sushil

You should be able to use css selectors and avoid a loop.

import pandas as pd

results = pd.DataFrame(zip([i.text for i in driver.find_elements_by_css_selector('#content_listContainer span')]
                           , [i.get_attribute('href') for i in driver.find_elements_by_css_selector.('#content_listContainer a')])
                           , columns = ['Name', 'Link'])

print(results)
Answered By: QHarr

To grab all the list items iteratively you can use xpath with index as shown below.

(//div[@class='item clearfix'])[1] #first li item index starts from 1 not 0
(//div[@class='item clearfix'])[2] #second li item
(//div[@class='item clearfix'])[3] #third li item
(//div[@class='item clearfix'])[4] #fourth li item

After getting each li item using index you can access its child elements according to their presence in the xpath as shown below.

(//div[@class='item clearfix'])[1]/h3/a #first li's h3/a tag

Considering this you can update your code as shown below to use a simple counter to get lists elements based on index.

modules = []
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")    #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']")  #Grab each li item

counter = 1 #use counter to iterate over all the li items based on index
for item in li_items:
    #append counter values as index for list items in xpath
    module_url = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a").get_attribute('href') 
    module_name = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a/span").text

    module = {
           "name": module_name,
           "url": module_url
    }

    modules.append(module)
    counter= counter + 1
    
#remove the first item from the list as its not required
modules.pop(0)
print(modules)
Answered By: Sariq Shaikh

I’ve just ran into a very similar issue and whilst I’m not exactly sure as to why, I think I’ve found a solution:

If you replace

module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')

with

module_url = item.find_element_by_xpath("./div[@class='item clearfix']/h3/a").get_attribute('href')

as in, replace the // with ./ at the start of your xpath (and make the same substitution in the module_name xpath), then I think it should work. I tried it against the html you provided and it seems to work. Again, really not sure why it works, I’ve tried looking into the XPath docs but it’s all Greek to me honestly.

Answered By: electricschmidt