Unable to produce results from a webpage using requests module

Question:

After accessing this website, when I fill in the inputbox (City or zip) with Miami, FL and hit the search button, I can see the related results displayed on that site.

I wish to mimic the same using requests module. I tried to follow the steps shown in dev tools but for some reason the script below comes up with this output:

You are not authorized to access this request.

I’ve tried with:

import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup

URL = "https://www.realtor.com/realestateagents/"
link = 'https://www.realtor.com/realestateagents/api/v3/search'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
    'Accept': 'application/json, text/plain, */*',
    'referer': 'https://www.realtor.com/realestateagents/',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.9,bn;q=0.8',
    'X-Requested-With': 'XMLHttpRequest',
    'x-newrelic-id': 'VwEPVF5XGwQHXFNTBAcAUQ==',
    'authorization': 'Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJleHAiOjE2NjQ1MjU0NDQsInN1YiI6ImZpbmRfYV9yZWFsdG9yIiwiaWF0IjoxNjY0NTI0Nzk2fQ.Q2jryTAD5vgsJ37e1SylBnkaeK7Cln930Q8KL4ANqsM'
}

params = {
    'nar_only': '1',
    'offset': '',
    'limit': '20',
    'marketing_area_cities': 'FL_Miami',
    'postal_code': '',
    'is_postal_search': 'true',
    'name': '',
    'types': 'agent',
    'sort': 'recent_activity_high',
    'far_opt_out': 'false',
    'client_id': 'FAR2.0',
    'recommendations_count_min': '',
    'agent_rating_min': '',
    'languages': '',
    'agent_type': '',
    'price_min': '',
    'price_max': '',
    'designations': '',
    'photo': 'true',
    'seoUserType': "{'isBot':'false','deviceType':'desktop'}",
    'is_county_search': 'false',
    'county': ''
}

with requests.Session() as s:
    s.headers.update(headers)
    res = s.get(link,params=params)
    print(res.status_code)
    print(res.json())

EDIT:

For those who think using res.json() is pointless, see this image, which was taken straight from the dev tool. If I could set up params and headers correctly while submitting requests, I could utilize res.json() successfully.

Asked By: robots.txt

||

Answers:

Based on your question – as asked – you are looking to pull information from that website, using requests. Here is a way of doing just that, with Python’s Requests:

import requests
from tqdm.notebook import tqdm
from bs4 import BeautifulSoup as bs

headers = {
    'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36"
    }
s = requests.Session()
s.headers.update(headers)
for x in tqdm(range(1, 5)):
    url = f'https://www.realtor.com/realestateagents/miami_fl/pg-{x}'    
    r = s.get(url)
    soup = bs(r.text, 'html.parser')
    agent_cards = soup.select('div[data-testid="component-agentCard"]')
    for a in agent_cards:
        agent_name = a.select_one('div.agent-name').get_text()
        agent_group = a.select_one('div.agent-group').get_text()
        agent_phone = a.select_one('div.agent-phone').get_text()
        print(agent_name, '|', agent_group, '|', agent_phone)

Result in terminal:

100%
4/4 [00:05<00:00, 1.36s/it]
Edmy Gomez | Coldwell Banker Realty | (954) 434-0501
Nidia L Cortes PA | Beachfront Realty Inc | (786) 287-9268
Rodney Ward | Coldwell Banker Realty | (305) 253-2800
Onelia Hurtado | Elevate Real Estate Brokers | (954) 559-8252
Gustavo Cabrera | Belhouse Real Estate, Llc | (305) 794-8533
Hermes Pallaviccini |  Global Luxury Realty LLC | (305) 772-7232
Maria Carrillo | Keyes - Brickell Office | (305) 984-3180
Nancy Batchelor, P.A. | COMPASS | (305) 903-2850
Winnie Uricola | Keyes - Hollywood Office | (305) 915-7721
monica Deluca | Re/Max Powerpro Realty | (954) 552-1224
Maria Cristina Korman | Keller Williams Realty Partners SW | (954) 588-2850
Ines Hegedus-Garcia | Avanti Way | (305) 758-2323
Jean-Paul Figallo | Concierge Real Estate | (754) 281-9912
[...]

You may want to increase the range to the total number of pages.

Answered By: Barry the Platipus

The error indicates you are not authorised to access the API, you may want to check if your token is expired.

In general using requests.get is not best way to mimic user actions such as filling forms and hitting search button on a website.

Try using browser automation tools such as selenium [1].

But if you already know the website structure, as is the case with your example , you may not need to fill form. You can directly do a get request to that page and then you can parse the content as shown in the other answer.

For example in your example website, there is a webpage for Miami Florida (https://www.realtor.com/realestateagents/miami_fl). You can directly get content of this site with requests.

Option 1 using browser automation

from selenium import webdriver
from selenium.webdriver.common.by import By

driver  = webdriver.Chrome()
driver.get('https://www.realtor.com/realestateagents/')
loc = driver.find_element(By.ID,'srchHomeLocation')
loc.send_keys("Miami, FL")
search_button = driver.find_element(By.ID,'far_search_button')
search_button.click()
r = driver.page_source
soup = bs(r.text, 'html.parser')
# ... continue parsing the content with soup

option 2 using requests

r = requests.get("https://www.realtor.com/realestateagents/miami_fl")
soup = bs(r.text, 'html.parser')
# ... continue parsing the content with soup

In both cases you need to handle page navigation. Either by clicking next in selenium or by doing a get request for all the 493 pages.

Finally, res.json() doesn’t convert any html to json, it returns a JSON object of the result only if the result was written in JSON format.

  1. https://www.selenium.dev/documentation/webdriver/
Answered By: yosemite_k

The issue is that the Authorization token is invalid after a few seconds, so you will need to refresh (regenerate) it per request.

First of all, you will need to get the JWT secret used to create the JWT tokens (RegEx to extract it from the HTML source code):

# Which is hardcoded in the HTML
SECRET = findall(r'"JWT_SECRET":"(.*?)"', requests.get('https://www.realtor.com/realestateagents/').text)[0]

Then use the secret to generate a new Authorization token:

# Create JWT
jwt_payload = {
  "exp": int(time() + 9999), # expiry date
  "sub": "find_a_realtor",
  "iat": int(time()) # issued at
}

# Encode it with their secret
jwt = encode(jwt_payload, SECRET, algorithm="HS256")

Add it to your headers, then run the request, like you did before:

# Add the JWT to the headers
headers = {
    'authorization': 'Bearer ' + jwt,
}

# Attach headers to the request
response = requests.get(
    url='https://www.realtor.com/realestateagents/api/v3/search?nar_only=1&offset=&limit=20&marketing_area_cities=FL_Miami&postal_code=&is_postal_search=true&name=&types=agent&sort=recent_activity_high&far_opt_out=false&client_id=FAR2.0&recommendations_count_min=&agent_rating_min=&languages=&agent_type=&price_min=&price_max=&designations=&photo=true&seoUserType=\{%22isBot%22:false,%22deviceType%22:%22desktop%22\}&is_county_search=false&county=',
    headers=headers
)

Putting it all together…

import requests
from jwt import encode
from time import time
from re import findall

# First we need to get their JWT Secret... which is securely hardcoded in the HTML
SECRET = findall(r'"JWT_SECRET":"(.*?)"', requests.get('https://www.realtor.com/realestateagents/').text)[0]

# Create JWT
jwt_payload = {
  "exp": int(time() + 9999),
  "sub": "find_a_realtor",
  "iat": int(time())
}

# Encode it with their secret
jwt = encode(jwt_payload, SECRET, algorithm="HS256")

# Add the JWT to the headers
headers = {
    'authorization': 'Bearer ' + jwt,
}

# Attach headers to the request
response = requests.get(
    url='https://www.realtor.com/realestateagents/api/v3/search?nar_only=1&offset=&limit=20&marketing_area_cities=FL_Miami&postal_code=&is_postal_search=true&name=&types=agent&sort=recent_activity_high&far_opt_out=false&client_id=FAR2.0&recommendations_count_min=&agent_rating_min=&languages=&agent_type=&price_min=&price_max=&designations=&photo=true&seoUserType=\{%22isBot%22:false,%22deviceType%22:%22desktop%22\}&is_county_search=false&county=',
    headers=headers
)

# Print the JSON output
print(response.json())
Answered By: Xiddoc