Difference between .string and .text BeautifulSoup
Question:
I noticed something odd about when working with BeautifulSoup and couldn’t find any documentation to support this so I wanted to ask over here.
Say we have a tags like these that we have parsed with BS:
<td>Some Table Data</td>
<td></td>
The official documented way to extract the data is soup.string
. However this extracted a NoneType for the second <td>
tag. So I tried soup.text
(because why not?) and it extracted an empty string exactly as I wanted.
However I couldn’t find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?
BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.
Answers:
The element
<td></td>
does not contain an empty string. It is equivalent to
<td/>
which has no child. For XML, “no text” and “zero length text” is the same.
So soup.string
is correct to return NoneType
.
See also How to create an XML text node with an empty string value (in Java)
.string
on a Tag
type object returns a NavigableString
type object. On the other hand, .text
gets all the child strings and return concatenated using the given separator. Return type of .text is unicode
object.
From the documentation, A NavigableString
is just like a Python Unicode
string, except that it also supports some of the features described in Navigating the tree and Searching the tree.
From the documentation on .string
, we can see that, If the html is like this,
<td>Some Table Data</td>
<td></td>
Then, .string
on the second td will return None
.
But .text
will return an empty string, which is a unicode
type object.
For greater convenience,
string
- Convenience property of a
tag
to get the single string within this tag.
- If the
tag
has a single string child then the return value is that string.
- If the
tag
has no children or more than one child then the return value is None
- If this
tag
has one child tag then the return value is the ‘string’ attribute of the child tag, recursively.
And text
- Get all the child strings and return concatenated using the given separator.
If the html
is like this:
<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>
.string
on the four td
will return,
some text
None
more text
None
.text
will give result like this,
some text
more text
even more text
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
example:
<td>sometext<p>sometext</p></td>
The above code will return NoneType if: td.string is done because the td contains texts as well as another p tag. But td.text will give : sometextsometext
I noticed something odd about when working with BeautifulSoup and couldn’t find any documentation to support this so I wanted to ask over here.
Say we have a tags like these that we have parsed with BS:
<td>Some Table Data</td>
<td></td>
The official documented way to extract the data is soup.string
. However this extracted a NoneType for the second <td>
tag. So I tried soup.text
(because why not?) and it extracted an empty string exactly as I wanted.
However I couldn’t find any reference to this in the documentation and am worried that something is a miss. Can anyone let me know if this is acceptable to use or will it cause problems later?
BTW I am scraping table data from a web page and mean to create CSVs from the data so I do actually need empty strings rather than NoneTypes.
The element
<td></td>
does not contain an empty string. It is equivalent to
<td/>
which has no child. For XML, “no text” and “zero length text” is the same.
So soup.string
is correct to return NoneType
.
See also How to create an XML text node with an empty string value (in Java)
.string
on a Tag
type object returns a NavigableString
type object. On the other hand, .text
gets all the child strings and return concatenated using the given separator. Return type of .text is unicode
object.
From the documentation, A NavigableString
is just like a Python Unicode
string, except that it also supports some of the features described in Navigating the tree and Searching the tree.
From the documentation on .string
, we can see that, If the html is like this,
<td>Some Table Data</td>
<td></td>
Then, .string
on the second td will return None
.
But .text
will return an empty string, which is a unicode
type object.
For greater convenience,
string
- Convenience property of a
tag
to get the single string within this tag. - If the
tag
has a single string child then the return value is that string. - If the
tag
has no children or more than one child then the return value isNone
- If this
tag
has one child tag then the return value is the ‘string’ attribute of the child tag, recursively.
And text
- Get all the child strings and return concatenated using the given separator.
If the html
is like this:
<td>some text</td>
<td></td>
<td><p>more text</p></td>
<td>even <p>more text</p></td>
.string
on the four td
will return,
some text
None
more text
None
.text
will give result like this,
some text
more text
even more text
If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None:
example:
<td>sometext<p>sometext</p></td>
The above code will return NoneType if: td.string is done because the td contains texts as well as another p tag. But td.text will give : sometextsometext