Selenium XPATH keeps grabbing the wrong tag element within loop
Question:
I’m currently web scraping my university webpage to download unit content. I’ve figured out how to collect the names/links to each unit, and am now currently trying to figure out how to collate the names/links to each individual module within a unit.
A rough description of the HTML on the modules page.
<ul id="content_listContainer" class="contentList">
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
</ul>
So I am trying to grab the link inside the href attribute of the <a>
tag within li/div/h3 and the name of the module within the span inside the <a>
tag. Here is the relevant code snippet.
modules = []
driver.get(unit_url)
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']") #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']") #Grab each li item
for item in li_items[1:]: #Skips first li tag as that is the Overview, not a module
module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')
#These are not moving on from the first module for some reason...
module_name = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a/span").text
module = {
"name": module_name,
"url": module_url
}
modules.append(module)
The issue/question:
Edit
I’ve tried @sushii and @QHarr solutions with no luck unfortunately. I should point out that the lines grabbing module_name and module_url within the for loop are returning the same first module data every LOOP. I’ve tested it with a different unit where the first couple <li>
tags are non-modules (introduction) and that should be returned but it is still only returning the same module 1.
/edit
Edit 2
Here is a link to the html I am trying to scrape. This isn’t the entire page as that would be way to big.
<html><body><div></div><div></div><div></div><div>
This is the DIV that is in the link </div><div></div><div></div></body></html>
I have verified that li_items definitely contains the <li>
tags I need so the other HTML shouldn’t be important (I think).
If you scroll about a quarter way down the <li>
tags I need are bolded and the information I need to scrape is underlined.
/Edit 2
The lines that grab the module_name and module_url within the for loop are only grabbing the info for the first module.
I have verified through debugging that li_items does contain all the li items and is not just grabbing the first one. I’m new to Selenium so my thinking is that there is something wrong with the xpath I have provided but it should only be grabbing the tags within the item iterable object. So I am confused as to why it keep grabbing the first li item’s info.
Answer Edit
Using @Sariq Shaikh ‘s answer I’ve solved the issue. Initially his technique using indexing [] of the elements to iterate over the <li>
tags wasn’t working but after altering the XPATH used for module_url and module_name to include the <ul>
tag and then using indexing with the <li>
tag has solved my issue.
However I still do not undestand why the original method was not working. Here is the altered code.
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")
ctr = 1
for _ in module_ul.find_elements_by_tag_name('li'):
try:
module_url = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a').get_attribute('href') #These are not moving on from the first module for some reason...
module_name = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a/span').text
except SelException.NoSuchElementException:
print("NoSuchElementExceptionn")
ctr += 1
continue
Answers:
This is actually very easy with BeautifulSoup
. Here is how u do it using BeautifulSoup
:
from bs4 import BeautifulSoup
html = """
<ul id="content_listContainer" class="contentList">
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
</ul>
"""
soup = BeautifulSoup(html,'html.parser')
lis = soup.find_all('li',class_ = 'clearfix liItem read')
for li in lis:
print(li.div.h3.a['href'])
Output:
Link To Module
Link To Module
Hope that this helps!
EDIT:
Since ur website is dynamically loaded using javascript
, u shd first open the url in selenium, get the html code of the website and close the browser. Here is how u do it:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
U can then parse this html using BeautifulSoup
. Hope that this helps!
You should be able to use css selectors and avoid a loop.
import pandas as pd
results = pd.DataFrame(zip([i.text for i in driver.find_elements_by_css_selector('#content_listContainer span')]
, [i.get_attribute('href') for i in driver.find_elements_by_css_selector.('#content_listContainer a')])
, columns = ['Name', 'Link'])
print(results)
To grab all the list items iteratively you can use xpath with index as shown below.
(//div[@class='item clearfix'])[1] #first li item index starts from 1 not 0
(//div[@class='item clearfix'])[2] #second li item
(//div[@class='item clearfix'])[3] #third li item
(//div[@class='item clearfix'])[4] #fourth li item
After getting each li item using index you can access its child elements according to their presence in the xpath as shown below.
(//div[@class='item clearfix'])[1]/h3/a #first li's h3/a tag
Considering this you can update your code as shown below to use a simple counter to get lists elements based on index.
modules = []
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']") #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']") #Grab each li item
counter = 1 #use counter to iterate over all the li items based on index
for item in li_items:
#append counter values as index for list items in xpath
module_url = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a").get_attribute('href')
module_name = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a/span").text
module = {
"name": module_name,
"url": module_url
}
modules.append(module)
counter= counter + 1
#remove the first item from the list as its not required
modules.pop(0)
print(modules)
I’ve just ran into a very similar issue and whilst I’m not exactly sure as to why, I think I’ve found a solution:
If you replace
module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')
with
module_url = item.find_element_by_xpath("./div[@class='item clearfix']/h3/a").get_attribute('href')
as in, replace the //
with ./
at the start of your xpath (and make the same substitution in the module_name
xpath), then I think it should work. I tried it against the html you provided and it seems to work. Again, really not sure why it works, I’ve tried looking into the XPath docs but it’s all Greek to me honestly.
I’m currently web scraping my university webpage to download unit content. I’ve figured out how to collect the names/links to each unit, and am now currently trying to figure out how to collate the names/links to each individual module within a unit.
A rough description of the HTML on the modules page.
<ul id="content_listContainer" class="contentList">
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
</ul>
So I am trying to grab the link inside the href attribute of the <a>
tag within li/div/h3 and the name of the module within the span inside the <a>
tag. Here is the relevant code snippet.
modules = []
driver.get(unit_url)
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']") #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']") #Grab each li item
for item in li_items[1:]: #Skips first li tag as that is the Overview, not a module
module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')
#These are not moving on from the first module for some reason...
module_name = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a/span").text
module = {
"name": module_name,
"url": module_url
}
modules.append(module)
The issue/question:
Edit
I’ve tried @sushii and @QHarr solutions with no luck unfortunately. I should point out that the lines grabbing module_name and module_url within the for loop are returning the same first module data every LOOP. I’ve tested it with a different unit where the first couple <li>
tags are non-modules (introduction) and that should be returned but it is still only returning the same module 1.
/edit
Edit 2
Here is a link to the html I am trying to scrape. This isn’t the entire page as that would be way to big.
<html><body><div></div><div></div><div></div><div>
This is the DIV that is in the link </div><div></div><div></div></body></html>
I have verified that li_items definitely contains the <li>
tags I need so the other HTML shouldn’t be important (I think).
If you scroll about a quarter way down the <li>
tags I need are bolded and the information I need to scrape is underlined.
/Edit 2
The lines that grab the module_name and module_url within the for loop are only grabbing the info for the first module.
I have verified through debugging that li_items does contain all the li items and is not just grabbing the first one. I’m new to Selenium so my thinking is that there is something wrong with the xpath I have provided but it should only be grabbing the tags within the item iterable object. So I am confused as to why it keep grabbing the first li item’s info.
Answer Edit
Using @Sariq Shaikh ‘s answer I’ve solved the issue. Initially his technique using indexing [] of the elements to iterate over the <li>
tags wasn’t working but after altering the XPATH used for module_url and module_name to include the <ul>
tag and then using indexing with the <li>
tag has solved my issue.
However I still do not undestand why the original method was not working. Here is the altered code.
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']")
ctr = 1
for _ in module_ul.find_elements_by_tag_name('li'):
try:
module_url = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a').get_attribute('href') #These are not moving on from the first module for some reason...
module_name = driver.find_element_by_xpath('//ul[@id="content_listContainer"]/li[' + str(ctr) + ']/div/h3/a/span').text
except SelException.NoSuchElementException:
print("NoSuchElementExceptionn")
ctr += 1
continue
This is actually very easy with BeautifulSoup
. Here is how u do it using BeautifulSoup
:
from bs4 import BeautifulSoup
html = """
<ul id="content_listContainer" class="contentList">
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
<li id="" class="clearfix liItem read">
<img></img>
<div class="item clearfix">
<h3>
<a href="Link To Module">
<span>Name of Module</span>
</a>
</h3>
</div>
</li>
</ul>
"""
soup = BeautifulSoup(html,'html.parser')
lis = soup.find_all('li',class_ = 'clearfix liItem read')
for li in lis:
print(li.div.h3.a['href'])
Output:
Link To Module
Link To Module
Hope that this helps!
EDIT:
Since ur website is dynamically loaded using javascript
, u shd first open the url in selenium, get the html code of the website and close the browser. Here is how u do it:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html = driver.page_source
U can then parse this html using BeautifulSoup
. Hope that this helps!
You should be able to use css selectors and avoid a loop.
import pandas as pd
results = pd.DataFrame(zip([i.text for i in driver.find_elements_by_css_selector('#content_listContainer span')]
, [i.get_attribute('href') for i in driver.find_elements_by_css_selector.('#content_listContainer a')])
, columns = ['Name', 'Link'])
print(results)
To grab all the list items iteratively you can use xpath with index as shown below.
(//div[@class='item clearfix'])[1] #first li item index starts from 1 not 0
(//div[@class='item clearfix'])[2] #second li item
(//div[@class='item clearfix'])[3] #third li item
(//div[@class='item clearfix'])[4] #fourth li item
After getting each li item using index you can access its child elements according to their presence in the xpath as shown below.
(//div[@class='item clearfix'])[1]/h3/a #first li's h3/a tag
Considering this you can update your code as shown below to use a simple counter to get lists elements based on index.
modules = []
module_ul = driver.find_element_by_xpath("//ul[@id='content_listContainer']") #Grab the ul list
li_items = module_ul.find_elements_by_xpath("//li[@class='clearfix liItem read']") #Grab each li item
counter = 1 #use counter to iterate over all the li items based on index
for item in li_items:
#append counter values as index for list items in xpath
module_url = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a").get_attribute('href')
module_name = item.find_element_by_xpath("(//div[@class='item clearfix'])["+str(counter)+"]/h3/a/span").text
module = {
"name": module_name,
"url": module_url
}
modules.append(module)
counter= counter + 1
#remove the first item from the list as its not required
modules.pop(0)
print(modules)
I’ve just ran into a very similar issue and whilst I’m not exactly sure as to why, I think I’ve found a solution:
If you replace
module_url = item.find_element_by_xpath("//div[@class='item clearfix']/h3/a").get_attribute('href')
with
module_url = item.find_element_by_xpath("./div[@class='item clearfix']/h3/a").get_attribute('href')
as in, replace the //
with ./
at the start of your xpath (and make the same substitution in the module_name
xpath), then I think it should work. I tried it against the html you provided and it seems to work. Again, really not sure why it works, I’ve tried looking into the XPath docs but it’s all Greek to me honestly.