in Python 3.6 – getting text using an XPath expression

Question:

<div class = "card-block cms>
<p>and then have a tea or coffee on the balcony of the cafeteria.</p>
<p>&nbsp;</p>
</div>

I am trying to check if the text I crawl of a website contains  

texts = driver.find_element_by_xpath("//div[@class='card-block cms']")
textInDivTag = texts.text
print(textInDivTag)
if u"xa0" in textInDivTag:
    print("yes")

My output is as follows:

and then have a tea or coffee on the balcony of the cafeteria.

As you can see, it doesn’t recognize the non-breaking space.

Asked By: Valdrin Shala

||

Answers:

The character is recognized, but it is being converted to a normal space (u"x20").

According to the comment in the Java Selenium sourcecode, .text / .getText() returns the visible text, and references the W3C webdriver specification, section "11.3.5 Get Element Text" (emphasis added by me):

The Get Element Text command intends to return an element’s text “as
rendered”. An element’s rendered text is also used for locating a
elements by their link text and partial link text.

One of the major inputs to this specification was the open source
Selenium project. This was in wide-spread use before this
specification written, and so had set user expectations of how the Get
Element Text command should work. As such, the approach presented here
is known to be flawed, but provides the best compatibility with
existing users.

So probably, this behavior is according to the specification, but I couldn’t yet find the source code specifically replacing non-breaking spaces by regular whitespace. I could also not find an issue in the Selenium repository, but maybe you can give it a try by opening one.

Answered By: soerface

To match u"xa0", use

textInDivTag = texts.get_attribute('innerText')

To match u"x20", use

textInDivTag = texts.text
Answered By: ewwink

Non-breaking Space (&nbsp;)

A non-breaking space i.e. &nbsp; is a space that will not break into a new line. Two words separated by a non-breaking space will stick together (not break into a new line). This is handy when breaking the words might be disruptive. Examples:

  • § 10
  • 10 km/h
  • 10 PM

Another common use of the non-breaking space is to prevent browsers from truncating spaces in HTML pages. If you write 10 spaces in your text, the browser will remove 9 of them. To add real spaces to your text, you can use the &nbsp; character entity.


Element.innerHTML

  • Syntax:

    const content = element.innerHTML;
    element.innerHTML = htmlString;
    
  • Value: Element.innerHTML is a DOMString containing the HTML serialization of the element’s descendants. Setting the value of innerHTML removes all of the element’s descendants and replaces them with nodes constructed by parsing the HTML given in the string htmlString.

  • Note: If a <div>, <span>, or <noembed> node has a child text node that includes the characters (&), (<), or (>), innerHTML returns these characters as the HTML entities &amp;, &lt; and &gt; respectively. Use Node.textContent to get a raw copy of these text nodes’ contents.


Node.innerText

Node.innerText is a property that represents the rendered text content of a node and its descendants. As a getter, it approximates the text the user would get if they highlighted the contents of the element with the cursor and then copied to the clipboard.


Node.textContent

Node.textContent property represents the text content of a node and its descendants.

  • Syntax:

    var text = element.textContent;
    element.textContent = "this is some sample text";
    
  • Description:

  • textContent returns null if the node is a document, a DOCTYPE, or a notation. To grab all of the text and CDATA data for the whole document, one could use document.documentElement.textContent.

  • If the node is a CDATA section, comment, processing instruction, or text node, textContent returns the text inside this node (the nodeValue).

  • For other node types, textContent returns the concatenation of the textContent of every child node, excluding comments and processing instructions. This is an empty string if the node has no children.


This usecase

As your usecase is to check if the website contains &nbsp; you have to use the textContent property as follows:

texts = driver.find_element_by_xpath("//div[@class='card-block cms']")
for my_text in texts:
    textInDivTag = texts.textContent
    print(textInDivTag)
Answered By: undetected Selenium