BeautifulSoup deleting first half of HTML?
Question:
I’m practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it’s HTML, then search through the webpage (in this case a recipe, to get a sub string of it’s ingredients). I’ve managed to get it working with the following code:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]
But when I try and use BeautifulSoup here:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)
I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that’s all I’m focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that’s where the problem lies but i’m not sure how to fix it.
Thank you.
Answers:
You need to use:
class_="recipe__ingredients"
For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
doc = (
BeautifulSoup(requests.get(url).text, "html.parser")
.find(class_="recipe__ingredients")
)
ingredients = "n".join(
ingredient.getText() for ingredient in doc.find_all("li")
)
print(ingredients)
Output:
1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve
It outputs None
because it’s looking for where the content within html tags is 'recipeIngredient'
.
What you are actually trying to get with bs4 is find specific tags and/or atributes of the data/content you want. For example, @baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".
The string 'recipeIngredient'
, that you pull out in that first block of code, is actually form within the <script>
tag in the html, that has the ingredients in json format.
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld+json'}).text
jsonData = json.loads(ingredients)
print(jsonData['recipeIngredient'])
I’m practicing with BeautifulSoup and HTML requests in general for the first time. The goal of the programme is to load a webpage and it’s HTML, then search through the webpage (in this case a recipe, to get a sub string of it’s ingredients). I’ve managed to get it working with the following code:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
myHTML = result.text
index1 = myHTML.find("recipeIngredient")
index2 = myHTML.find("recipeInstructions")
ingredients = myHTML[index1:index2]
But when I try and use BeautifulSoup here:
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find(text = "recipeIngredient")
print(ingredients)
I understand that the code above (even if I could get it working) would produce a different output of just ["recipeIngredient"] but that’s all I’m focused on for now whilst I get to grips with BS. Instead the code above just outputs None. I printed "doc" to the terminal and it would only output what appears to be the second half of the HTML (or at least : not all of it). Whereas , the text file does contain all HTML, so I assume that’s where the problem lies but i’m not sure how to fix it.
Thank you.
You need to use:
class_="recipe__ingredients"
For example:
import requests
from bs4 import BeautifulSoup
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
doc = (
BeautifulSoup(requests.get(url).text, "html.parser")
.find(class_="recipe__ingredients")
)
ingredients = "n".join(
ingredient.getText() for ingredient in doc.find_all("li")
)
print(ingredients)
Output:
1 large onion , chopped
4 large garlic cloves
thumb-sized piece of ginger
2 tbsp rapeseed oil
4 small skinless chicken breasts, cut into chunks
2 tbsp tikka spice powder
1 tsp cayenne pepper
400g can chopped tomatoes
40g ground almonds
200g spinach
3 tbsp fat-free natural yogurt
½ small bunch of coriander , chopped
brown basmati rice , to serve
It outputs None
because it’s looking for where the content within html tags is 'recipeIngredient'
.
What you are actually trying to get with bs4 is find specific tags and/or atributes of the data/content you want. For example, @baduker points out, the ingredients in the html are within the tag with a class attribute = "recipe__ingredients".
The string 'recipeIngredient'
, that you pull out in that first block of code, is actually form within the <script>
tag in the html, that has the ingredients in json format.
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.bbcgoodfood.com/recipes/healthy-tikka-masala"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
ingredients = doc.find('script', {'type':'application/ld+json'}).text
jsonData = json.loads(ingredients)
print(jsonData['recipeIngredient'])