Is there a way to scrape a page with XHR autoload?
Question:
there is this site with telegram chats of neighbours in Moscow.
https://moscow.chatnovosela.ru/novostroyki
i need to scrape it and get links to every card on this site.
the trick is: cards are being appended by XHR when user is reaching the bottom of the page and requests can’t get them all. is there a way to load them all at once? i’ve done my research and found out that i can use Selenium for it somehow. where do i start?
Answers:
I quess you need to something like this (any question you can ask freely, i dont know about xhr but this code can scrape the card urls):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
main_url = "https://moscow.chatnovosela.ru/novostroyki"
driver = webdriver.Chrome(executable_path="<DRIVERPATH>/chromedriver")
driver.get(main_url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
list_items = soup.find_all("div", attrs = {"class":"col-md-6 col-lg-4 col-xl-3 m-b-30"})
url_list = []
for x in range(len(list_items)):
try:
xpath = '//*[@id="showmore-list"]/div[' + str(x+1) + ']/div/a'
li_item = driver.find_element(By.XPATH, xpath).get_attribute("href")
url = { 'url' : li_item }
url_list.append(url)
except Exception as e:
print(e)
continue
print(url_list)
i’ve done my research and found out that i can use Selenium for it somehow
No need to use Selenium – it’s an overkill for this kind of task. Instead you can use simple HTTP requests to emulate the "bottom of the page" load behaviour.
Just iterate over pages in XHR requests and print found apartment URLs:
import requests
from bs4 import BeautifulSoup
HEADERS = {
'referer': 'https://moscow.chatnovosela.ru/novostroyki',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/111.0.0.0 Safari/537.36',
}
def find_apartment_urls() -> None:
page_number = 1
while True:
with requests.Session() as sess:
# get a root page to get and save cookies necessary for other requests
_ = sess.get('https://moscow.chatnovosela.ru/novostroyki')
# prepare an XHR request
resp = sess.post(
'https://moscow.chatnovosela.ru/service.php',
data=dict(
type='get_novostroyli_objects', # a typo in API here
page=page_number,
city=3,
),
headers=HEADERS,
)
# extract hrefs from XHR response; can also be done with regexp
soup = BeautifulSoup(resp.text, "lxml")
apartment_urls = {x.get('href') for x in soup.findAll('a')}
# print results; check if the end is reached
if apartment_urls:
print(f'Apartments found on page #{page_number}: '
f'{", ".join(apartment_urls)}')
page_number += 1
else:
print('Search is finished.') # no data == last page is reached
break
if __name__ == '__main__':
find_apartment_urls()
Output:
Apartments found on page 1: https://moscow.chatnovosela.ru/object/lyublinskiy_park_2253, https://moscow.chatnovosela.ru/object/triniti, https://moscow.chatnovosela.ru/object/myakinino_park, https://moscow.chatnovosela.ru/object/kronshtadtskiy_9_2671, https://moscow.chatnovosela.ru/object/life_varshavskaya, https://moscow.chatnovosela.ru/object/d1, https://moscow.chatnovosela.ru/object/green_park_2428, https://moscow.chatnovosela.ru/object/wellton_towers, https://moscow.chatnovosela.ru/object/baltiyskiy, https://moscow.chatnovosela.ru/object/jazz, https://moscow.chatnovosela.ru/object/now_kvartal_na_naberezhnoy, https://moscow.chatnovosela.ru/object/dmitrovskiy_park_2889
Apartments found on page 2: https://moscow.chatnovosela.ru/object/sheremetevskiy, https://moscow.chatnovosela.ru/object/mihaylovskiy_park, https://moscow.chatnovosela.ru/object/stolichnye_polyany, https://moscow.chatnovosela.ru/object/volzhskiy_park_2554, https://moscow.chatnovosela.ru/object/aquatoria, https://moscow.chatnovosela.ru/object/bolshaya_ochakovskaya_2, https://moscow.chatnovosela.ru/object/river_park_3047, https://moscow.chatnovosela.ru/object/pervyy_moskovskiy, https://moscow.chatnovosela.ru/object/savelovskiy_siti_2064, https://moscow.chatnovosela.ru/object/seliger_siti, https://moscow.chatnovosela.ru/object/salarevo_park, https://moscow.chatnovosela.ru/object/lyubov_i_golubi
...
there is this site with telegram chats of neighbours in Moscow.
https://moscow.chatnovosela.ru/novostroyki
i need to scrape it and get links to every card on this site.
the trick is: cards are being appended by XHR when user is reaching the bottom of the page and requests can’t get them all. is there a way to load them all at once? i’ve done my research and found out that i can use Selenium for it somehow. where do i start?
I quess you need to something like this (any question you can ask freely, i dont know about xhr but this code can scrape the card urls):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
main_url = "https://moscow.chatnovosela.ru/novostroyki"
driver = webdriver.Chrome(executable_path="<DRIVERPATH>/chromedriver")
driver.get(main_url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, 'html.parser')
list_items = soup.find_all("div", attrs = {"class":"col-md-6 col-lg-4 col-xl-3 m-b-30"})
url_list = []
for x in range(len(list_items)):
try:
xpath = '//*[@id="showmore-list"]/div[' + str(x+1) + ']/div/a'
li_item = driver.find_element(By.XPATH, xpath).get_attribute("href")
url = { 'url' : li_item }
url_list.append(url)
except Exception as e:
print(e)
continue
print(url_list)
i’ve done my research and found out that i can use Selenium for it somehow
No need to use Selenium – it’s an overkill for this kind of task. Instead you can use simple HTTP requests to emulate the "bottom of the page" load behaviour.
Just iterate over pages in XHR requests and print found apartment URLs:
import requests
from bs4 import BeautifulSoup
HEADERS = {
'referer': 'https://moscow.chatnovosela.ru/novostroyki',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/111.0.0.0 Safari/537.36',
}
def find_apartment_urls() -> None:
page_number = 1
while True:
with requests.Session() as sess:
# get a root page to get and save cookies necessary for other requests
_ = sess.get('https://moscow.chatnovosela.ru/novostroyki')
# prepare an XHR request
resp = sess.post(
'https://moscow.chatnovosela.ru/service.php',
data=dict(
type='get_novostroyli_objects', # a typo in API here
page=page_number,
city=3,
),
headers=HEADERS,
)
# extract hrefs from XHR response; can also be done with regexp
soup = BeautifulSoup(resp.text, "lxml")
apartment_urls = {x.get('href') for x in soup.findAll('a')}
# print results; check if the end is reached
if apartment_urls:
print(f'Apartments found on page #{page_number}: '
f'{", ".join(apartment_urls)}')
page_number += 1
else:
print('Search is finished.') # no data == last page is reached
break
if __name__ == '__main__':
find_apartment_urls()
Output:
Apartments found on page 1: https://moscow.chatnovosela.ru/object/lyublinskiy_park_2253, https://moscow.chatnovosela.ru/object/triniti, https://moscow.chatnovosela.ru/object/myakinino_park, https://moscow.chatnovosela.ru/object/kronshtadtskiy_9_2671, https://moscow.chatnovosela.ru/object/life_varshavskaya, https://moscow.chatnovosela.ru/object/d1, https://moscow.chatnovosela.ru/object/green_park_2428, https://moscow.chatnovosela.ru/object/wellton_towers, https://moscow.chatnovosela.ru/object/baltiyskiy, https://moscow.chatnovosela.ru/object/jazz, https://moscow.chatnovosela.ru/object/now_kvartal_na_naberezhnoy, https://moscow.chatnovosela.ru/object/dmitrovskiy_park_2889
Apartments found on page 2: https://moscow.chatnovosela.ru/object/sheremetevskiy, https://moscow.chatnovosela.ru/object/mihaylovskiy_park, https://moscow.chatnovosela.ru/object/stolichnye_polyany, https://moscow.chatnovosela.ru/object/volzhskiy_park_2554, https://moscow.chatnovosela.ru/object/aquatoria, https://moscow.chatnovosela.ru/object/bolshaya_ochakovskaya_2, https://moscow.chatnovosela.ru/object/river_park_3047, https://moscow.chatnovosela.ru/object/pervyy_moskovskiy, https://moscow.chatnovosela.ru/object/savelovskiy_siti_2064, https://moscow.chatnovosela.ru/object/seliger_siti, https://moscow.chatnovosela.ru/object/salarevo_park, https://moscow.chatnovosela.ru/object/lyubov_i_golubi
...