How to scrape second <p> of webpage using python and Beautifulsoup

Question

I’ve been trying to work with BeautifulSoup because I want to try and scrape a webpage (https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1). So far I scraped some elements with success but now I wanted to scrape a movie description but I’ve been struggling. The description is simply situated like this in html :

<div class="lister-item mode-advanced"> 
    <div class="lister-item-content> 
       <p class="muted-text"> paragraph I don't need</p>
       <p class="muted-text"> paragraph I need</p>
    </div>
</div>

I want to scrape the second paragraph which seemed easy to do but everything I tried gave me ‘None’ as output. I’ve been digging around to find an answer. In an other stackoverflow post I found that

find('p:nth-of-type(1)')

or

find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')

could do the trick but it still gives me

none #as output

Below you can find a piece of my code it’s a bit low grade code because I’m just trying out stuff to learn

 import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title? 
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode- 
advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description

the above code gives me this output:

$ python scrape.py
Logan
(2017)
8.1
None

I would like to learn the correct method of selecting html tags because it will be useful to know for future projects.

Asked By: imkeVr

||

Source

Answer 1

find_all() method looks through a tag’s descendants and retrieves
all descendants that match your filters.

You can then use the list’s index to get the element you need. Index starts at 0, so 1 will give the second item.

Change the first_description to this.

first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()

Full code

import urllib2
from bs4 import BeautifulSoup
from requests import get

url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')

first_movie = movie_containers[0]

first_title = first_movie.h3.a.text
print first_title

first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year

first_imdb = float(first_movie.strong.text)
print first_imdb

# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description

Output

Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.

Read the Documentation to learn the correct method of selecting html tags.

Also consider moving to python 3.

Answered By: Bitto Bennichan

Answer 2

Just playing around with .next_sibling was able to get it. There’s probably a more elegant way though. At least might give you a start/some direction

from bs4 import BeautifulSoup


html = '''<div class="lister-item mode-advanced"> 
    <div class="lister-item-content> 
       <p class="muted-text"> paragraph I don't need</p>
       <p class="muted-text"> paragraph I need</p>
    </div>
</div>'''


soup = BeautifulSoup(html, 'html.parser')


first_p = soup.find('div',{'class':'lister-item mode-advanced'}).text.strip()
second_p = soup.find('div',{'class':'lister-item mode-advanced'}).next_sibling.next_sibling.text.strip()



print (second_p)

Output:

print (second_p)
paragraph I need

Answered By: chitown88

Answer 3

BeautifulSoup 4.71 support :nth-child() or any CSS4 selectors

first_description = soup.select_one('.lister-item-content p:nth-child(4)')
# or 
#first_description = soup.select_one('.lister-item-content p:nth-of-type(2)')

print(desc)

Answered By: ewwink

How to scrape second <p> of webpage using python and Beautifulsoup

Question:

Answers: