Why do I get an AttributeError when trying to use BeautifulSoup's `.find` to find text in a page?

Question:

I am trying to scrape a website with BeautifulSoup but am having a problem.
I was following a tutorial done in python 2.7 and it had exactly the same code in it and had no problems.

import urllib.request
from bs4 import *


htmlfile = urllib.request.urlopen("http://en.wikipedia.org/wiki/Steve_Jobs")

htmltext = htmlfile.read()

soup = BeautifulSoup(htmltext)
title = (soup.title.text)

body = soup.find("Born").findNext('td')
print (body.text)

If I try to run the program I get,

Traceback (most recent call last):
  File "C:UsersUSERDocumentsPython ProgramsWorld Population.py", line 13, in <module>
    body = soup.find("Born").findNext('p')
AttributeError: 'NoneType' object has no attribute 'findNext'

Is this a problem with python 3 or am i just too naive?

Asked By: user3247140

||

Answers:

The find and find_all methods do not search for arbitrary text in the document, they search for HTML tags. The documentation makes that clear (my italics):


Pass in a value for name and you’ll tell Beautiful Soup to only consider tags with certain names. Text strings will be ignored, as will tags whose names that don’t match. This is the simplest usage:

soup.find_all("title")
# [<title>The Dormouse's story</title>]

That’s why your soup.find("Born") is returning None and hence why it complains about NoneType (the type of None) having no findNext() method.

That page you reference contains (at the time this answer was written) eight copies of the word "born", none of which are tags.

Looking at the HTML source for that page, you’ll find the best option may be to look for the correct span (formatted for readabilty):

<th scope="row" style="text-align: left;">Born</th>
<td>
    <span class="nickname">Steven Paul Jobs</span><br />
    <span style="display: none;">(<span class="bday">1955-02-24</span>)</span>February 24, 1955<br />
</td>
Answered By: paxdiablo

The find method looks for tags, not text. To find the name, birthday and birthplace, you would have to look up the span elements with the corresponding class name, and access the text attribute of that item:

import urllib.request
from bs4 import *


soup = BeautifulSoup(urllib.request.urlopen("http://en.wikipedia.org/wiki/Steve_Jobs"))
title = soup.title.text
name = soup.find('span', {'class': 'nickname'}).text
bday = soup.find('span', {'class': 'bday'}).text
birthplace = soup.find('span', {'class': 'birthplace'}).text

print(name)
print(bday)
print(birthplace)

Output:

Steven Paul Jobs
1955-02-24
San Francisco, California, US

PS: You don’t have to call read on urlopen, BS accept file-like objects.

Answered By: Steinar Lima