Python BeautifulSoup issue in extracting direct text in a given html tag
Question:
I am trying to extract direct text in a given HTML tag. Simply, for <p> Hello! </p>
, the direct text is Hello!
. The code works well except with the case below.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div> <i> </i> FF Services </div>', "html.parser")
for tag in soup.find_all():
direct_text = tag.find(string=True, recursive=False)
print(tag, ':', direct_text)
Output:
`<div> <i> </i> FF Services </div> : `
`<i> </i> : `
The first printed output should be <div> <i> </i> FF Services </div> : FF Services
, but it skips FF Services
. I found that when I delete <i> </i>
the code works fine.
What’s the problem here?
Answers:
Using .find_all
instead of .find
will give the desired output. Try this code.
for tag in soup.find_all():
direct_text = tag.find_all(string=True, recursive=False)
print(tag, ':', direct_text)
The issue is not of BeautifulSoup
methods. Also your code works… you just fall in your own trap!^1 div.find(string=True)
get the 1st match of a node containing a string and when it parses<div> <i>...
there is for sure <i>
but before it there is also a NavigableString
which consists of a single white space. This means that in your code there is actually a single white space printed. Here a test:
print(f"{len(soup.div.find(string=True)) = }") # using same soup as in the question
#len(soup.div.find(string=True)) = 1
It is helpful to look at the tag’s content:
for tag in soup.div.contents:
print(f"-{tag}-", type(tag))
#- - <class 'bs4.element.NavigableString'>
#-<i> </i>- <class 'bs4.element.Tag'>
#- FF Services - <class 'bs4.element.NavigableString'>
Be aware that, from the doc,:
"If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:". To bypass this either use better navigation instructions if possible or use strings
, stripped_strings
together with some parsing.
^1 it happened also to me smt similar
I am trying to extract direct text in a given HTML tag. Simply, for <p> Hello! </p>
, the direct text is Hello!
. The code works well except with the case below.
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div> <i> </i> FF Services </div>', "html.parser")
for tag in soup.find_all():
direct_text = tag.find(string=True, recursive=False)
print(tag, ':', direct_text)
Output:
`<div> <i> </i> FF Services </div> : `
`<i> </i> : `
The first printed output should be <div> <i> </i> FF Services </div> : FF Services
, but it skips FF Services
. I found that when I delete <i> </i>
the code works fine.
What’s the problem here?
Using .find_all
instead of .find
will give the desired output. Try this code.
for tag in soup.find_all():
direct_text = tag.find_all(string=True, recursive=False)
print(tag, ':', direct_text)
The issue is not of BeautifulSoup
methods. Also your code works… you just fall in your own trap!^1 div.find(string=True)
get the 1st match of a node containing a string and when it parses<div> <i>...
there is for sure <i>
but before it there is also a NavigableString
which consists of a single white space. This means that in your code there is actually a single white space printed. Here a test:
print(f"{len(soup.div.find(string=True)) = }") # using same soup as in the question
#len(soup.div.find(string=True)) = 1
It is helpful to look at the tag’s content:
for tag in soup.div.contents:
print(f"-{tag}-", type(tag))
#- - <class 'bs4.element.NavigableString'>
#-<i> </i>- <class 'bs4.element.Tag'>
#- FF Services - <class 'bs4.element.NavigableString'>
Be aware that, from the doc,:
"If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:". To bypass this either use better navigation instructions if possible or use strings
, stripped_strings
together with some parsing.
^1 it happened also to me smt similar