getpath in lxml etree is showing different output for absolute xpath

Question:

I am trying to get the absolute XPath of an element but it is giving different output. I am trying to get the full XPath of search button in Google. Here’s the code I have tried:

import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from lxml import etree

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--log-level=3")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)

main_link = r"https://www.google.com"
driver.get(main_link)

time.sleep(5)

with open ("dom.xml","w",encoding="utf-8") as domfile:
    domfile.write(driver.page_source)
tree = etree.parse("dom.xml",parser=etree.XMLParser(recover=True))
print(tree)
element = tree.xpath("(//input[@class='gNO89b'])[2]")
print(element)
# Trying to print absolute xpath . . 
print(tree.getpath(element[0]))

Output should be: /html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]

But it is giving me: /html/head/meta/meta/meta/link/script[6]/br/body/div/div[2]/div[2]/form/div/div/div/div[2]/div[2]/div[7]/center/input

Asked By: fardV

||

Answers:

This is because you are parsing the output from html with xml. Since they are 2 different format, there will be some difference upon converting. The best way to retain the HTMl as it is will be parsing it as a string instead.

import time
import random
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import lxml.html

options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_argument("--log-level=3")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s, options=options)
main_link = r"https://www.google.com"

driver.get(main_link)
time.sleep(5)

tree = lxml.html.fromstring(driver.page_source)
root = tree.getroottree()
element = tree.xpath("(//input[@class='gNO89b'])[2]")
print(root.getpath(element[0]))

Output:
/html/body/div[1]/div[3]/form/div[1]/div[1]/div[4]/center/input[1]

If your goal is to serialise an HTML document as an XML document after parsing, you may have to consider to apply some manual preprocessing first.

Answered By: Frederickcjo