Scraping game name containing the "@": scraper recognizes such name as email address

Question:

I want to scrape games’ information. However, some games’ name contains "@", such as the game "Ampers@t".

When I try to scrape such games’ title, the code will return me "[email protected]". Apparently, my code does not recognize that this is game’s name, and is not an email.

Here are my codes used.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from bs4 import BeautifulSoup
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)

def grab_soup(url):
    """Takes a url and returns a BeautifulSoup object"""
    response = session.get(url, headers={"User-Agent": "Mozilla/5.0"})
    assert response.status_code == 200, "Problem with url request! %s throws %s" % (
        url,
        response.status_code,
    )  # checking that it worked
    page = response.text
    soup = BeautifulSoup(page, "lxml")
    return soup

soup = grab_soup("https://www.mobygames.com/game/amperst")

header = soup.find(class_="niceHeaderTitle")("a")

What I expect the output of the header is

<a href="https://www.mobygames.com/game/amperst">Ampers@t</a>,
...

However, the output is:

<a href="https://www.mobygames.com/game/amperst"><span class="__cf_email__" data-cfemail="ce8fa3beabbcbd8eba">[email protected]</span></a>,
...

And I try to check the page source of the game. Indeed, the page source is recorded like this:

<div class="rightPanelHeader">
<h1 class="niceHeaderTitle">
<a href="https://www.mobygames.com/game/amperst">
<span class="__cf_email__" data-cfemail="cd8ca0bda8bfbe8db9">[email&#160;protected]
</span>
</a>"

Therefore, the reason my code gives me the content probably because it is what being recorded in the page source. But IS there a way that I can get rid of this issue?

I found this solution. But it is undesire for me as I have lots of games to scrape and cannot run the code for each of game.

Asked By: Xiaowei Zhang

||

Answers:

[Expanded from my comment] If you tweak the function from this answer a bit to

def deCFEmail(encTag):
    if not (encTag.get('data-cfemail') or encTag.select('*[data-cfemail]')):
        encTag.append(f'[! no "data-cfemail" attribute !]')
    else:
        fp = encTag.get('data-cfemail', None)
        if fp is None:
            fp = encTag.select_one('*[data-cfemail]').get('data-cfemail')
        try:
            r = int(fp[:2],16)
            encTag.string = ''.join([chr(int(fp[i:i+2], 16) ^ r) for i in range(2, len(fp), 2)]) 
        except Exception as e:
            encTag.append(f'! failed to decode "{e}"')
    return encTag

then you can use it conditionally:

for i, h in enumerate(header): 
    if h.get_text()=='[emailxa0protected]': 
        header[i] = deCFEmail(h)

So, header can go from

[<a href="https://www.mobygames.com/game/amperst"><span class="__cf_email__" data-cfemail="8ecfe3feebfcfdcefa">[email protected]</span></a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/amperst/forums">Discuss</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/sheet/review_game/amperst/">Review</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/amperst/add-to-want-list">+ Want</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/amperst/add-to-have-list">+ Have</a>,
 <a class="btn btn-xs btn-mobysuccess" href="https://www.mobygames.com/game/amperst/contribute">Contribute</a>]

to

[<a href="https://www.mobygames.com/game/amperst">Ampers@t</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/amperst/forums">Discuss</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/sheet/review_game/amperst/">Review</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/amperst/add-to-want-list">+ Want</a>,
 <a class="btn btn-xs btn-clear" href="https://www.mobygames.com/game/amperst/add-to-have-list">+ Have</a>,
 <a class="btn btn-xs btn-mobysuccess" href="https://www.mobygames.com/game/amperst/contribute">Contribute</a>]

it is kind of a speed requirement

The additional time would not be unlike adding an extra find statement; and anyways, it would be insignificant compared to the parsing time (for soup = BeautifulSoup(page, "lxml")).

Answered By: Driftr95
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.