BS Extract all text between two specified keyword
Question:
With Python and BS i need to extract all text contained between two specified word
blabla text i need blibli
I succeed to extract inside DIV and TAG but not for specific and different keyword.
Thank you for your help
Answers:
Assuming that you have extracted all the words between the specified tag, you now have a string extracted in chronological order to the way that the text was written…
Once you have your full text string, you can get a substring between two words that are different and each only occur once in the text:
text = {text}
def get_textchunk(word1, word2, text):
new_text = text.split(word1)
new_text = new_text[1]
newnew_text = new_text.split(word2)
return newnew_text
print(get_textchunk('word1','word2',text)[0])
This is a function that will split in two steps using two different words.
If you want to get text between two of the same words that occur only twice (once at the start of the text and once at the end) use this code:
def get_textchunk(word, text):
text = text.split(word)
return text
print(get_textchunk('word', text)[1])
This will get you the middle of the text you just split.
If you want to get text between two words that are different but occur frequently in the body of your text use this code:
def get_textchunk(word1, word2, text):
idx1 = text.index(word1)
idx2 = text.index(word2)
for idx in range(idx1 + len(word1) + 1, idx2):
new_text = new_text + text[idx]
return new_text
This function may be the most helpful for you.
Before all thank you to take you time to help me.
To be clear i got this text
Catégorie ererreregrg(75)
Contact
Adresse : fffgdfdgfrrrere
rrgegreggregr
Téléphone : egrrgerererg
text = BeautifulSoup(response.text,'lxml')
def get_textchunk(word1, word2, text):
new_text = text.split(</h3>)
new_text = new_text[1]
newnew_text = new_text.split(Téléphone)
return newnew_text
print(get_textchunk('word1','word2',text)[0])
I need to get all text between
Adresse and Téléphone
And my code don’t give me any result
Thanks you for your help and your time
Setup:
# from bs4 import BeautifulSoup
# htmlStr = response.text ## if you're fetching with a request
htmlStr = '''
<div>
<h3>Some Section</h3>
Some text from Section 1
<h3> Contact </h3>
Adresse : Address Line One <br/> Address Line Two <br/>
Téléphone : 0X XX XX XX XX <br/>
Site : <a href="https://example.com/">https://example.com/</a>
<h3>Some Other Section</h3>
field 1 : some info
field 2 : __ <br/> field 3 : some other info
</div>
'''
soup = BeautifulSoup(htmlStr, 'lxml')
Getting All Information in a Section
def get_section_info(section_header):
hName, hList = section_header.name, 'h1,h2,h3,h4,h5,h6'
hList = [h for h in hList.split(hName)[0].split(',') if h] + [hName]
section_info, cur_key = {}, None
for ns in section_header.next_siblings:
if ns.name in hList: break ## stop if you reach the next section
if not ((isinstance(ns,str) and not ns.PREFIX) or ns.name): continue ## skip
nsStr = ' '.join((ns if isinstance(ns,str) else ns.get_text(' ')).split())
if ':' in nsStr and not nsStr.startswith('http'):
cur_key, nsStr = [s.strip() for s in nsStr.split(':',1)]
ckStr = section_info.get(cur_key, '')
if ns.name == 'br': section_info[cur_key] = ckStr + ' n'
elif nsStr: section_info[cur_key] = ckStr + ' ' + nsStr
section_info = {k:v.strip() for k,v in section_info.items()}
return section_info[None] if [*section_info]==[None] else section_info
This might look a bit unnecessarily convoluted, but you could get all the contact information with
if (contact_h3 := soup.find('h3', string=lambda s: s and s.strip()=='Contact')):
contact_info = get_section_info(contact_h3)
else: contact_info, _ = {}, print('Could not find <h3>Contact</h3>')
and contact_info
would look like
{ 'Adresse': 'Address Line One n Address Line Two',
'Téléphone': '0X XX XX XX XX',
'Site': 'https://example.com/' }
You could even get all the h3
sections as simply as
{h3.text.strip(): get_section_info(h3) for h3 in soup.select('h3')}
which would return
{
'Some Section': 'Some text from Section 1',
'Contact': {
'Adresse': 'Address Line One n Address Line Two',
'Téléphone': '0X XX XX XX XX',
'Site': 'https://example.com/'
},
'Some Other Section': {
'field 1': 'some info field 2 : __', 'field 3': 'some other info'
}
}
Note: field 2
is merged into field 1
value because there’s nothing separating them in the HTML, so there’s no way to know if the key should be 2
or field 2
or info field 2
or… so the function assumes that there is a maximum of one field per NavigableString; and 'Some Section'
only has a string value instead of a dictionary since there doesn’t seem to be any :
s separating field names from the relevant info.
Getting Only One Chunk of Text
If you really just only want the text between two words, you can just use
def get_textchunk(word1, word2, text):
if not (word1 in text and word2 in text): return ''
return text.split(word1)[-1].split(word2)[0]
## can refine with more conditions/string-manipulations/regex/etc
get_textchunk('Adresse :', 'Téléphone :', soup.get_text(' '))
However, if you’re not sure of the next field name, but you’re sure that the field names are separated with :
, you can use this version of get_section_info
:
def get_text_by_field(soupX, fieldName):
fsCond = lambda s: s and ':' in s and s.split(':')[0].strip()==fieldName
fieldStr = soupX.find(string=fsCond)
hList, myText = [f'h{i}' for i in range(1,7)], fieldStr.split(':',1)[1]
for ns in (fieldStr.next_siblings if fieldStr else []):
if ns.name in hList: break ## stop when you reach the next section
if not ((isinstance(ns,str) and not ns.PREFIX) or ns.name): continue
nsStr = ' '.join((ns if isinstance(ns,str) else ns.get_text(' ')).split())
if ':' in nsStr and not nsStr.startswith('http'): break
if ns.name == 'br': myText += ' n'
elif nsStr: myText += (' ' + nsStr)
return myText.strip()
then get_text_by_field(soup, 'Adresse')
should return
'Address Line One n Address Line Two'
Maybe you can try to convert the lines of text into one string. I am thinking that the type of your text is actually a list and not a string.
print(type(text)) to find the type of your text
If it’s a list then:
list = ["Adresse : fffgdfdgfrrrere",
"rrgegreggregr",
"Téléphone : egrrgerererg"]
text = ' '.join(list)
print(text)
Then:
text = "Adresse: fffgdfdgfrrrere rrgegreggregr Téléphone: egrrgerererg"
def get_textchunk(word1, word2, text):
new_text = text.split(word1)
new_text = new_text[1]
newnew_text = new_text.split(word2)
return newnew_text
print(get_textchunk('Adresse: ',' Téléphone', text)[0])
Answer: "fffgdfdgfrrrere rrgegreggregr"
With Python and BS i need to extract all text contained between two specified word
blabla text i need blibli
I succeed to extract inside DIV and TAG but not for specific and different keyword.
Thank you for your help
Assuming that you have extracted all the words between the specified tag, you now have a string extracted in chronological order to the way that the text was written…
Once you have your full text string, you can get a substring between two words that are different and each only occur once in the text:
text = {text}
def get_textchunk(word1, word2, text):
new_text = text.split(word1)
new_text = new_text[1]
newnew_text = new_text.split(word2)
return newnew_text
print(get_textchunk('word1','word2',text)[0])
This is a function that will split in two steps using two different words.
If you want to get text between two of the same words that occur only twice (once at the start of the text and once at the end) use this code:
def get_textchunk(word, text):
text = text.split(word)
return text
print(get_textchunk('word', text)[1])
This will get you the middle of the text you just split.
If you want to get text between two words that are different but occur frequently in the body of your text use this code:
def get_textchunk(word1, word2, text):
idx1 = text.index(word1)
idx2 = text.index(word2)
for idx in range(idx1 + len(word1) + 1, idx2):
new_text = new_text + text[idx]
return new_text
This function may be the most helpful for you.
Before all thank you to take you time to help me.
To be clear i got this text
Catégorie ererreregrg(75)
Contact
Adresse : fffgdfdgfrrrere
rrgegreggregr
Téléphone : egrrgerererg
text = BeautifulSoup(response.text,'lxml')
def get_textchunk(word1, word2, text):
new_text = text.split(</h3>)
new_text = new_text[1]
newnew_text = new_text.split(Téléphone)
return newnew_text
print(get_textchunk('word1','word2',text)[0])
I need to get all text between
Adresse and Téléphone
And my code don’t give me any result
Thanks you for your help and your time
Setup:
# from bs4 import BeautifulSoup
# htmlStr = response.text ## if you're fetching with a request
htmlStr = '''
<div>
<h3>Some Section</h3>
Some text from Section 1
<h3> Contact </h3>
Adresse : Address Line One <br/> Address Line Two <br/>
Téléphone : 0X XX XX XX XX <br/>
Site : <a href="https://example.com/">https://example.com/</a>
<h3>Some Other Section</h3>
field 1 : some info
field 2 : __ <br/> field 3 : some other info
</div>
'''
soup = BeautifulSoup(htmlStr, 'lxml')
Getting All Information in a Section
def get_section_info(section_header):
hName, hList = section_header.name, 'h1,h2,h3,h4,h5,h6'
hList = [h for h in hList.split(hName)[0].split(',') if h] + [hName]
section_info, cur_key = {}, None
for ns in section_header.next_siblings:
if ns.name in hList: break ## stop if you reach the next section
if not ((isinstance(ns,str) and not ns.PREFIX) or ns.name): continue ## skip
nsStr = ' '.join((ns if isinstance(ns,str) else ns.get_text(' ')).split())
if ':' in nsStr and not nsStr.startswith('http'):
cur_key, nsStr = [s.strip() for s in nsStr.split(':',1)]
ckStr = section_info.get(cur_key, '')
if ns.name == 'br': section_info[cur_key] = ckStr + ' n'
elif nsStr: section_info[cur_key] = ckStr + ' ' + nsStr
section_info = {k:v.strip() for k,v in section_info.items()}
return section_info[None] if [*section_info]==[None] else section_info
This might look a bit unnecessarily convoluted, but you could get all the contact information with
if (contact_h3 := soup.find('h3', string=lambda s: s and s.strip()=='Contact')):
contact_info = get_section_info(contact_h3)
else: contact_info, _ = {}, print('Could not find <h3>Contact</h3>')
and contact_info
would look like
{ 'Adresse': 'Address Line One n Address Line Two', 'Téléphone': '0X XX XX XX XX', 'Site': 'https://example.com/' }
You could even get all the h3
sections as simply as
{h3.text.strip(): get_section_info(h3) for h3 in soup.select('h3')}
which would return
{ 'Some Section': 'Some text from Section 1', 'Contact': { 'Adresse': 'Address Line One n Address Line Two', 'Téléphone': '0X XX XX XX XX', 'Site': 'https://example.com/' }, 'Some Other Section': { 'field 1': 'some info field 2 : __', 'field 3': 'some other info' } }
Note: field 2
is merged into field 1
value because there’s nothing separating them in the HTML, so there’s no way to know if the key should be 2
or field 2
or info field 2
or… so the function assumes that there is a maximum of one field per NavigableString; and 'Some Section'
only has a string value instead of a dictionary since there doesn’t seem to be any :
s separating field names from the relevant info.
Getting Only One Chunk of Text
If you really just only want the text between two words, you can just use
def get_textchunk(word1, word2, text):
if not (word1 in text and word2 in text): return ''
return text.split(word1)[-1].split(word2)[0]
## can refine with more conditions/string-manipulations/regex/etc
get_textchunk('Adresse :', 'Téléphone :', soup.get_text(' '))
However, if you’re not sure of the next field name, but you’re sure that the field names are separated with :
, you can use this version of get_section_info
:
def get_text_by_field(soupX, fieldName):
fsCond = lambda s: s and ':' in s and s.split(':')[0].strip()==fieldName
fieldStr = soupX.find(string=fsCond)
hList, myText = [f'h{i}' for i in range(1,7)], fieldStr.split(':',1)[1]
for ns in (fieldStr.next_siblings if fieldStr else []):
if ns.name in hList: break ## stop when you reach the next section
if not ((isinstance(ns,str) and not ns.PREFIX) or ns.name): continue
nsStr = ' '.join((ns if isinstance(ns,str) else ns.get_text(' ')).split())
if ':' in nsStr and not nsStr.startswith('http'): break
if ns.name == 'br': myText += ' n'
elif nsStr: myText += (' ' + nsStr)
return myText.strip()
then get_text_by_field(soup, 'Adresse')
should return
'Address Line One n Address Line Two'
Maybe you can try to convert the lines of text into one string. I am thinking that the type of your text is actually a list and not a string.
print(type(text)) to find the type of your text
If it’s a list then:
list = ["Adresse : fffgdfdgfrrrere",
"rrgegreggregr",
"Téléphone : egrrgerererg"]
text = ' '.join(list)
print(text)
Then:
text = "Adresse: fffgdfdgfrrrere rrgegreggregr Téléphone: egrrgerererg"
def get_textchunk(word1, word2, text):
new_text = text.split(word1)
new_text = new_text[1]
newnew_text = new_text.split(word2)
return newnew_text
print(get_textchunk('Adresse: ',' Téléphone', text)[0])
Answer: "fffgdfdgfrrrere rrgegreggregr"