BeautifulSoup deleting first half of HTML?

Question

I’m practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it’s HTML, then search through the webpage (in this case a recipe, to get a sub string of it’s ingredients). I’ve managed to get it working with the following code:

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]

But when I try and use BeautifulSoup here:

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)

I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that’s all I’m focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that’s where the problem lies but i’m not sure how to fix it.

Thank you.

Asked By: iFallOffStuff

||

Source

Answer 1

You need to use:

class_="recipe__ingredients"

For example:

import requests
from bs4 import BeautifulSoup

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

doc = (
    BeautifulSoup(requests.get(url).text, "html.parser")
    .find(class_="recipe__ingredients")
)

ingredients = "n".join(
    ingredient.getText() for ingredient in doc.find_all("li")
)

print(ingredients)

Output:

1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve

Answered By: baduker

Answer 2

It outputs None because it’s looking for where the content within html tags is 'recipeIngredient'.

What you are actually trying to get with bs4 is find specific tags and/or atributes of the data/content you want. For example, @baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".

The string 'recipeIngredient', that you pull out in that first block of code, is actually form within the <script> tag in the html, that has the ingredients in json format.

from bs4 import BeautifulSoup
import requests
import json

url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld+json'}).text
jsonData = json.loads(ingredients)

print(jsonData['recipeIngredient'])

Answered By: chitown88

BeautifulSoup deleting first half of HTML?

Question:

Answers: