BeautifulSoup get text from an element containing substring

Question:

I’m scrapping a webpage that uploads different documents and I want to retrieve some information from this documents. At first I hard coded the scrapper to search the information on a certain xpath, but now I see that this might change depending on the document. Is there any way to get the text from an element that contains a substring?

Here’s an example:

I want to get the company name, the HTML were it appears follows this:

<div id="fullDocument">
   <div class="tab">
      <div id="docHeader">...</div>
      <ul id="docToc">...</ul>
      <div class="stdoc">...</div>
      <div id="DocumentBody">
         <div class="stdoc">...</div>
         <div class="stdoc">...</div>
         <div class="stdoc">...</div>
         <div class="stdoc">...</div>
         <div class="grseq">
            <p class="tigrseq">...</p>
            <div class="mlioccur">
               <span class="nomark"></span>
               <span class="timark"></span>
               <div class="txtmark">
                  "Official name: Company Name"
                  <br>
                  "Identification: xxxxxx"
                  <br>
                  "Postal code: 00000"
                  <br>
                  "City: city"
               </div>
            </div>
         </div>
      </div>
   </div>
</div>

For this example, I hardcoded into my script the following code:

from lxml import etree

class LTED:
   def __init__(self, url, soup):
      if(not soup)
         soup = get_soup_from_url(url, "html.parser")
         dom = etree.HTML(str(soup))

      self.organization = self.get_organization(dom)

   def get_organization(self, dom):
      item = dom.xpath("/div[@id='fullDocument']/div/div[3]/div[5]/div/div")[0].text
      return item.split(": ")[1]

This actually works for the example, but as I mentioned the problem is that the xpath might change depending on the document, for example, instead of "/div[@id='fullDocument']/div/div[3]/div[5]/div/div" might change to "/div[@id='fullDocument']/div/div[3]/div[6]/div/div" or something similar.

Trying to solve this I searched on the Internet and found this, but didn’t work for me:

item = soup.find_all("div", string="Official name:")

I expected this to return a list with all elements containing the substring "Official name:" but it gave me an empty list [].

Is there any way to get the element containing the substring so independently of the xpath I can always get the Company Name and any other information I might need?

Asked By: David Jimenez

||

Answers:

I expected this to return a list with all elements containing the substring "Official name:" but it gave me an empty list [].

That is because it needs an exact match, but you could use re.compile:

import re
soup.find_all(text = re.compile('Official name:'))

However, why not using an alternative approach (selecting by class) that will give you a structured output?

For a single one:

dict(i.strip('"').split(': ') for i in soup.select_one('#DocumentBody div.txtmark').stripped_strings)

### leads to
{'Official name': 'Company Name',
  'Identification': 'xxxxxx',
  'Postal code': '00000',
  'City': 'city'}

or for multiple in your document:

[dict(i.strip('"').split(': ') for i in list(e.stripped_strings)) for e in soup.select('div.txtmark')]


### leads to
[{'Official name': 'Company Name',
  'Identification': 'xxxxxx',
  'Postal code': '00000',
  'City': 'city'},
 {'Official name': 'Company Name B',
  'Identification': 'xxxxxx',
  'Postal code': '00000',
  'City': 'city'}]

Example

from bs4 import BeautifulSoup

html='''
<div id="fullDocument">
   <div class="tab">
      <div id="docHeader">...</div>
      <ul id="docToc">...</ul>
      <div class="stdoc">...</div>
      <div id="DocumentBody">
         <div class="stdoc">...</div>
         <div class="stdoc">...</div>
         <div class="stdoc">...</div>
         <div class="stdoc">...</div>
         <div class="grseq">
            <p class="tigrseq">...</p>
            <div class="mlioccur">
               <span class="nomark"></span>
               <span class="timark"></span>
               <div class="txtmark">
                  "Official name: Company Name"
                  <br>
                  "Identification: xxxxxx"
                  <br>
                  "Postal code: 00000"
                  <br>
                  "City: city"
               </div>
            </div>
         </div>
      </div>
   </div>
</div>
'''

soup = BeautifulSoup(html)

dict(i.strip('"').split(': ') for i in soup.select_one('#DocumentBody div.txtmark').stripped_strings)
Answered By: HedgeHog