BS Extract all text between two specified keyword

Question

With Python and BS i need to extract all text contained between two specified word

blabla text i need blibli

I succeed to extract inside DIV and TAG but not for specific and different keyword.

Thank you for your help

Asked By: steve figueras

||

Source

Answer 1

Assuming that you have extracted all the words between the specified tag, you now have a string extracted in chronological order to the way that the text was written…

Once you have your full text string, you can get a substring between two words that are different and each only occur once in the text:

text = {text}
def get_textchunk(word1, word2, text):
        new_text = text.split(word1)
        new_text = new_text[1]
        newnew_text = new_text.split(word2)
        return newnew_text
print(get_textchunk('word1','word2',text)[0])

This is a function that will split in two steps using two different words.

If you want to get text between two of the same words that occur only twice (once at the start of the text and once at the end) use this code:

def get_textchunk(word, text):
        text = text.split(word)
        return text
print(get_textchunk('word', text)[1])

This will get you the middle of the text you just split.

If you want to get text between two words that are different but occur frequently in the body of your text use this code:

def get_textchunk(word1, word2, text):
        idx1 = text.index(word1)
        idx2 = text.index(word2)
        for idx in range(idx1 + len(word1) + 1, idx2):
                new_text = new_text + text[idx]
        return new_text

This function may be the most helpful for you.

Answered By: Abigail Tjie

Answer 2

Before all thank you to take you time to help me.

To be clear i got this text

Catégorie ererreregrg(75)

Contact

Adresse : fffgdfdgfrrrere
rrgegreggregr
Téléphone : egrrgerererg

  text = BeautifulSoup(response.text,'lxml')
            def get_textchunk(word1, word2, text):
                new_text = text.split(</h3>)
                new_text = new_text[1]
                newnew_text = new_text.split(Téléphone)
                return newnew_text
                print(get_textchunk('word1','word2',text)[0])

I need to get all text between

Adresse and Téléphone
And my code don’t give me any result

Thanks you for your help and your time

Answered By: steve figueras

Answer 3

Setup:

# from bs4 import BeautifulSoup
# htmlStr = response.text ## if you're fetching with a request
htmlStr = '''
<div>
    <h3>Some Section</h3>
    Some text from Section 1

    <h3> Contact </h3>
    Adresse : Address Line One <br/> Address Line Two <br/>
    Téléphone : 0X XX XX XX XX <br/> 
    Site : <a href="https://example.com/">https://example.com/</a>

    <h3>Some Other Section</h3>
    field 1 : some info 
    field 2 : __ <br/> field 3 : some other info
</div>
'''

soup = BeautifulSoup(htmlStr, 'lxml')

Getting All Information in a Section

def get_section_info(section_header):
    hName, hList = section_header.name, 'h1,h2,h3,h4,h5,h6'
    hList = [h for h in hList.split(hName)[0].split(',') if h] + [hName]

    section_info, cur_key = {}, None
    for ns in section_header.next_siblings:
        if ns.name in hList: break ## stop if you reach the next section
        if not ((isinstance(ns,str) and not ns.PREFIX) or ns.name): continue ## skip

        nsStr = ' '.join((ns if isinstance(ns,str) else ns.get_text(' ')).split())
        if ':' in nsStr and not nsStr.startswith('http'): 
            cur_key, nsStr = [s.strip() for s in nsStr.split(':',1)]

        ckStr = section_info.get(cur_key, '')
        if ns.name == 'br': section_info[cur_key] = ckStr + ' n'
        elif nsStr: section_info[cur_key] = ckStr + ' ' + nsStr
    
    section_info = {k:v.strip() for k,v in section_info.items()}
    return section_info[None] if [*section_info]==[None] else section_info

This might look a bit unnecessarily convoluted, but you could get all the contact information with

if (contact_h3 := soup.find('h3', string=lambda s: s and s.strip()=='Contact')):
    contact_info = get_section_info(contact_h3)
else: contact_info, _ = {}, print('Could not find <h3>Contact</h3>')

and contact_info would look like

{ 'Adresse': 'Address Line One n Address Line Two', 
  'Téléphone': '0X XX XX XX XX', 
  'Site': 'https://example.com/' }

You could even get all the h3 sections as simply as

{h3.text.strip(): get_section_info(h3) for h3 in soup.select('h3')}

which would return

{
  'Some Section': 'Some text from Section 1',
  'Contact': {
    'Adresse': 'Address Line One n Address Line Two',
    'Téléphone': '0X XX XX XX XX',
    'Site': 'https://example.com/'
  },
  'Some Other Section': {
    'field 1': 'some info field 2 : __', 'field 3': 'some other info'
  }
}

^{Note: field 2 is merged into field 1 value because there’s nothing separating them in the HTML, so there’s no way to know if the key should be 2 or field 2 or info field 2 or… so the function assumes that there is a maximum of one field per NavigableString; and 'Some Section' only has a string value instead of a dictionary since there doesn’t seem to be any :s separating field names from the relevant info.}

Getting Only One Chunk of Text

If you really just only want the text between two words, you can just use

def get_textchunk(word1, word2, text):
    if not (word1 in text and word2 in text): return ''
    return text.split(word1)[-1].split(word2)[0]
    ## can refine with more conditions/string-manipulations/regex/etc

get_textchunk('Adresse :', 'Téléphone :', soup.get_text(' '))

However, if you’re not sure of the next field name, but you’re sure that the field names are separated with :, you can use this version of get_section_info:

def get_text_by_field(soupX, fieldName):
    fsCond = lambda s: s and ':' in s and s.split(':')[0].strip()==fieldName
    fieldStr = soupX.find(string=fsCond)
    hList, myText = [f'h{i}' for i in range(1,7)], fieldStr.split(':',1)[1]
    for ns in (fieldStr.next_siblings if fieldStr else []):
        if ns.name in hList: break ## stop when you reach the next section
        if not ((isinstance(ns,str) and not ns.PREFIX) or ns.name): continue 

        nsStr = ' '.join((ns if isinstance(ns,str) else ns.get_text(' ')).split())
        if ':' in nsStr and not nsStr.startswith('http'): break

        if ns.name == 'br': myText += ' n'
        elif nsStr: myText += (' ' + nsStr)
    return myText.strip()

then get_text_by_field(soup, 'Adresse') should return

'Address Line One n Address Line Two'

Answered By: Driftr95

Answer 4

Maybe you can try to convert the lines of text into one string. I am thinking that the type of your text is actually a list and not a string.

print(type(text)) to find the type of your text

If it’s a list then:

list = ["Adresse : fffgdfdgfrrrere",
"rrgegreggregr",
"Téléphone : egrrgerererg"]

text = ' '.join(list)
print(text)

Then:

text = "Adresse: fffgdfdgfrrrere rrgegreggregr Téléphone: egrrgerererg"
def get_textchunk(word1, word2, text):
            new_text = text.split(word1)
            new_text = new_text[1]
            newnew_text = new_text.split(word2)
            return newnew_text
print(get_textchunk('Adresse: ',' Téléphone', text)[0])

Answer: "fffgdfdgfrrrere rrgegreggregr"