Fetch all href link using selenium in python
Question:
I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.
For example, I want all the links in the href=
property of all the <a>
tags on http://psychoticelites.com/
I’ve written a script and it is working. But, it’s giving me the object address. I’ve tried using the id
tag to get the value, but, it doesn’t work.
My current script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")
assert "Psychotic" in driver.title
continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)
Answers:
Well, you have to simply loop through the list:
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
find_elements_by_*
returns a list of elements (note the spelling of ‘elements’). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href
).
You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:
https://pypi.python.org/pypi/htmldom/2.0
from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")
dom = dom.createDom()
The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call “createDom” method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.
You can query the elements using the “find” method of HtmlDom object:
p_links = dom.find("a")
for link in p_links:
print ("URL: " +link.attr("href"))
The above code will print all the links/urls present on the web page
You can try something like:
links = driver.find_elements_by_partial_link_text('')
import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:chromedriverschromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
print(link)
I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.
elems = driver.find_elements_by_tag_name('a')
for elem in elems:
href = elem.get_attribute('href')
if href is not None:
print(href)
Unfortunately, the original link posted by OP is dead…
If you’re looking for a way to scrape links on a page, here’s how you can scrape all of the "Hot Network Questions" links on this page with gazpacho:
from gazpacho import Soup
url = "https://stackoverflow.com/q/34759787/3731467"
soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")
[a.attrs["href"] for a in a_tags]
driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
driver.close()
Note: Adding delay is very important. First run it in debug mode and Make sure your URL page is getting loaded. If the page is loading slowly, increase delay (sleep time) and then extract.
If you still face any issues, please refer below link (explained with an example) or comment
You can do this by using BeautifulSoup with very easy and efficient way. I have tested the below codes and worked fine for the same purpose.
After this line –
driver.get("http://psychoticelites.com/")
use the below code –
response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
if link.get('href'):
print(link.get("href"))
print('n')
Update for the existing solving Post:
For the current version it needs to be:
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.
For example, I want all the links in the href=
property of all the <a>
tags on http://psychoticelites.com/
I’ve written a script and it is working. But, it’s giving me the object address. I’ve tried using the id
tag to get the value, but, it doesn’t work.
My current script:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://psychoticelites.com/")
assert "Psychotic" in driver.title
continue_link = driver.find_element_by_tag_name('a')
elem = driver.find_elements_by_xpath("//*[@href]")
#x = str(continue_link)
#print(continue_link)
print(elem)
Well, you have to simply loop through the list:
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
find_elements_by_*
returns a list of elements (note the spelling of ‘elements’). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href
).
You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:
https://pypi.python.org/pypi/htmldom/2.0
from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")
dom = dom.createDom()
The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call “createDom” method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.
You can query the elements using the “find” method of HtmlDom object:
p_links = dom.find("a")
for link in p_links:
print ("URL: " +link.attr("href"))
The above code will print all the links/urls present on the web page
You can try something like:
links = driver.find_elements_by_partial_link_text('')
import requests
from selenium import webdriver
import bs4
driver = webdriver.Chrome(r'C:chromedriverschromedriver') #enter the path
data=requests.request('get','https://google.co.in/') #any website
s=bs4.BeautifulSoup(data.text,'html.parser')
for link in s.findAll('a'):
print(link)
I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.
elems = driver.find_elements_by_tag_name('a')
for elem in elems:
href = elem.get_attribute('href')
if href is not None:
print(href)
Unfortunately, the original link posted by OP is dead…
If you’re looking for a way to scrape links on a page, here’s how you can scrape all of the "Hot Network Questions" links on this page with gazpacho:
from gazpacho import Soup
url = "https://stackoverflow.com/q/34759787/3731467"
soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")
[a.attrs["href"] for a in a_tags]
driver.get(URL)
time.sleep(7)
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))
driver.close()
Note: Adding delay is very important. First run it in debug mode and Make sure your URL page is getting loaded. If the page is loading slowly, increase delay (sleep time) and then extract.
If you still face any issues, please refer below link (explained with an example) or comment
You can do this by using BeautifulSoup with very easy and efficient way. I have tested the below codes and worked fine for the same purpose.
After this line –
driver.get("http://psychoticelites.com/")
use the below code –
response = requests.get(browser.current_url)
soup = BeautifulSoup(response.content, 'html.parser')
for link in soup.find_all('a'):
if link.get('href'):
print(link.get("href"))
print('n')
Update for the existing solving Post:
For the current version it needs to be:
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))