Scrape Email address from a Tripadvisor webpage

Question:

I am trying to scrape the Email Address from the following webpage using Python-BS4-requests, but the email address is not accessible in the source code.

https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html

The email address opens up in my Mail App, but I could not find the link to it on the page source.
I understand this could be done by observing the network tab and making the same post request that websites makes, but could not make it work.

enter image description here

enter image description here

Thanks in advance!!

Asked By: Lavi Goyal

||

Answers:

The email is Base64 encoded inside the Json variable found on the page.

You can use this example to get all emails found on page:

import re
import json
import base64
import requests
from bs4 import BeautifulSoup


url = 'https://www.tripadvisor.in/Attraction_Review-g189400-d2020955-Reviews-Chat_Tours-Athens_Attica.html'

html_data = requests.get(url).text
data = re.search(r'window.__WEB_CONTEXT__=({.*?});', html_data).group(1)
data = json.loads(data.replace('pageManifest', '"pageManifest"'))

def get_emails(val):
    if isinstance(val, dict):
        for k, v in val.items():
            if k == 'email':
                if v:
                    yield v
            else:
                yield from get_emails(v)
    elif isinstance(val, list):
        for v in val:
            yield from get_emails(v)

for email in get_emails(data):
    email = base64.b64decode(email).decode('utf-8')
    email = re.search(r'mailto:(.*)_', email).group(1)

    print(email)

Prints:

[email protected]
Answered By: Andrej Kesely