Scrape Email address from a Tripadvisor webpage

Question

I am trying to scrape the Email Address from the following webpage using Python-BS4-requests, but the email address is not accessible in the source code.

https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html

The email address opens up in my Mail App, but I could not find the link to it on the page source.
I understand this could be done by observing the network tab and making the same post request that websites makes, but could not make it work.

Thanks in advance!!

Asked By: Lavi Goyal

||

Source

Answer 1

The email is Base64 encoded inside the Json variable found on the page.

You can use this example to get all emails found on page:

import re
import json
import base64
import requests
from bs4 import BeautifulSoup


url = 'https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html'

html_data = requests.get(url).text
data = re.search(r'window.__WEB_CONTEXT__=({.*?});', html_data).group(1)
data = json.loads(data.replace('pageManifest', '"pageManifest"'))

def get_emails(val):
    if isinstance(val, dict):
        for k, v in val.items():
            if k == 'email':
                if v:
                    yield v
            else:
                yield from get_emails(v)
    elif isinstance(val, list):
        for v in val:
            yield from get_emails(v)

for email in get_emails(data):
    email = base64.b64decode(email).decode('utf-8')
    email = re.search(r'mailto:(.*)_', email).group(1)

    print(email)

Prints:

[email protected]

Answered By: Andrej Kesely

Scrape Email address from a Tripadvisor webpage

Question:

Answers: