Beautiful Soup not returning anything I expected

Question

Background:
Following along with a Udemy tutorial which is parsing some information from Bing.
It takes in a user input and uses that as a parameter to search Bing with, returning all the href links it can find on the first page

Code:

from bs4 import BeautifulSoup
import requests as re

search = input("Enter what you wanna search: n")
params = {"q": search}
r = re.get("https://www.bing.com/search", params=params)

soup = BeautifulSoup(r.text, 'html.parser')

results = soup.find("ol",{"id":"b_results"})
links = results.findAll("li",{"class": "b_algo"})


for item in links:
    item_text = item.find("a").text
    item_href = item.href("a").attrs["href"]

    if item_text and item_href:
        print(item_text)
        print(item_href)

    else:
        print("Couldn't find 'a' or 'href'")

Problem:
It returns nothing. The code obviously works for him. I get no errors as I’ve checked the id and class names to see if they’ve been changed on bing itself since the video was made but they are still the same

"ol",{"id":"b_results"}
"li",{"class": "b_algo"}

Any ideas? I’m a complete noob to web scraping but intermediate to Python.

Thanks in advance!

Asked By: user13641095

||

Source

Answer 1

your script is working fine. If you look carefully to the requests answer (e.g. save r.text into a file), you’ll see the answer is full of javascript.

Following this method, you’ll see that the body is full of <script> balises:

<!DOCTYPE html>
<body>
<script>(...)</script>
<script>(...)</script>
<script>(...)</script>
</body>
</html>

I suggest to try another website, or use Selenium. Did Udemy really ask to try to scrape bing.com ?

Answered By: pyOliv

Answer 2

Your code needs a bit of reworking.

First of all, you need headers otherwise Bing (correctly) thinks you’re a bot and it’s not returning the HTML.

Then, you need to check if the anchors are not None and, say, have at least http in the href.

For example:

from bs4 import BeautifulSoup
import requests


headers = {
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.67 Safari/537.36",
}
page = requests.get("https://www.bing.com/search?", headers=headers, params={"q": "python"}).text
soup = BeautifulSoup(page, 'html.parser')

anchors = soup.find_all("a")
for anchor in anchors:
    if anchor is not None:
        try:
            if "http" in anchor["href"]:
                print(anchor.getText(), anchor["href"])
        except KeyError:
            continue

Output:

Welcome to Python.org https://www.python.org/
Diese Seite übersetzen http://www.microsofttranslator.com/bv.aspx?ref=SERP&br=ro&mkt=de-DE&dl=de&lp=EN_DE&a=https%3a%2f%2fwww.python.org%2f
Python Downloads https://www.python.org/downloads/
Windows https://www.python.org/downloads/windows/
Python for Beginners https://www.python.org/about/gettingstarted/
About https://www.python.org/about/
Documentation https://www.python.org/doc/
Community https://www.python.org/community/
Success Stories https://www.python.org/success-stories/
News https://www.python.org/blogs/
Python (Programmiersprache) – Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
CC-BY-SA-Lizenz http://creativecommons.org/licenses/by-sa/3.0/
Python lernen - Python Kurs für Anfänger und Fortgeschrittene https://www.python-lernen.de/
Python 3.9.0 (64bit) für Windows - Download https://python.de.uptodown.com/windows
Python-Tutorial: Tutorial für Anfänger und Fortgeschrittene https://www.python-kurs.eu/kurs.php
Mehr zu python-kurs.eu anzeigen https://www.python-kurs.eu/kurs.php
Python (Programmiersprache) – Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29
Python (Programmiersprache) - Wikipedia https://de.wikipedia.org/wiki/Python_%28Programmiersprache%29

By the way, what course is this, because scraping search engines is not easy?

Answered By: baduker

Answer 3

This answer complements the baduker solution.

First of all, yes, the code from that Udemy course could be outdated. Especially if it shows what selectors/HTML elements they used to extract certain data.

Some of the selectors are changed a few times on Bing.

As baduker said, it’s because Bing detects that it’s a script that sends a request. It simply detects because the default requests user-agent is python-requests so when you make a request, Bing sees that the user-agent from that request is python-requests thus unusual.

To override the default user-agent we can pass a new user-agent to custom-made request headers. Check what’s your user-agent.

Additionally, you can rotate user-agents which will reduce the chance of being blocked a little more. You can also play around by passing different device user-agent such as desktop, tablet, or mobile.

Code:

from bs4 import BeautifulSoup
import requests

search = input('Enter what you wanna search: ')

params = {'q': search}

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}

html = requests.get('https://www.bing.com/search', params=params, headers=headers)
soup = BeautifulSoup(html.text, 'html.parser')

for result in soup.select('.b_algo h2'):
    link = result.select_one('a')['href']
    print(link)

Outputs (only links from organic results, not from the knowledge graph, related searches, etc):

Enter what you wanna search: minecraft
https://www.minecraft.net/
https://classic.minecraft.net/

https://www.malavida.com/en/soft/minecraft/
https://nl.ccm.net/download/downloaden-34090900-minecraft
https://www.systemrequirementslab.com/cyri/requirements/minecraft/11356
https://www.funnygames.nl/spel/minecraft.html
https://spele.nl/minecraft-games/
https://www.youtube.com/user/TeamMojang

If you don’t want to deal with bypassing blocks or maintaining your code, have a look at Bing Search API from SerpApi.

Example code:

from serpapi import GoogleSearch

search = input('Enter what you wanna search: ')

params = {
    'api_key': '<your-serpapi-api-key>',
    'device': 'desktop',
    'engine': 'bing',
    'q': search
    # other parameters
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    print(result['link'])

Outputs:

Enter what you wanna search: minecraft
https://www.minecraft.net/
https://apps.apple.com/us/app/minecraft/id479516143
https://classic.minecraft.net/
https://education.minecraft.net/en-us/mobile

Answered By: Dmitriy Zub

Beautiful Soup not returning anything I expected

Question:

Answers: