Dificulties webscraping, URL doesnt change on search, remain the same and for every item searched there is no url page, remain the same

Question:

I have been thinking first about theoretical way to web scrape this page

https://www.mercadopublico.cl/Home is the Chilean government open business where you can apply to get deliver some services to the state.

Mercadopublico

so I search Camas ( mean "bed" in Spanish)
bed search

So the first barrier I have found is that the URL doesn’t change at all with my search: https://www.mercadopublico.cl/Home/BusquedaLicitacion will be the same at any search

url don’t change

the second barrier , wont change either if I change to the next page. so I cant code a URL changing type on a array as I would like to do.

the third barrier is the most information I want

is in another pop up window from the main one that doesn’t change

pop up window

there the information could be downloaded in a CSV or JSON or either would be webscraped from the pop-up window.

But so far I am not able to get the solution for the part that url doesnt change when I change the search or the page. So I wasn’t able to think so far because I cant get the first part to be done.

I think that webscrape the popup would be the easier because in that point I already have an URL.(the pop up window does have a different URL!)

If you know how or if i need another metodology to do it ( since right know I have been only using BS4 for do it) please let me know in what direction should I walk.

here is my first error i dont know how to solve with ussual code, if you help me with that i cant go further, that is change URL to get the matrix url, because i cant use range method

 # -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

import requests
from bs4 import BeautifulSoup as bs

url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'

#problem here because i cant navigate beacuse ajax doesnt let me
params = {
    'page': 0,
    'page1': 40,
}

results = []

for offset in range(0, 121, 40):  #  this method doesnt work on ajax page

    params['start'] = offset

    response = requests.get(url, params=params)
    print('url:', response.url)
    #print('status:', response.status_code)
                    
    soup = bs(response.text, "html.parser")

    all_products = soup.find_all('div', {'class': 'product-tile'})

    for product in all_products:
        itemid = product.get('data-itemid') 
        print('itemid:', itemid)

        data = product.get('data-product') 
        print('data:', data)
        
        name = product.find('span', {'itemprop': 'name'}).text
        print('name:', name)
        
        all_prices = product.find_all('div', {'class': 'price__text'})
        print('len(all_prices):', len(all_prices))
        
        price = all_prices[0].get('aria-label')
        print('price:', price)
        
        results.append( (itemid, name, price, data) )
        print('results')

# ---

# ... here you can save all `results` in file ...
import pandas as pd
df = pd.DataFrame(data = results[1:],columns = results[0])
df.to_excel('results.xlsx', index=False,header = False)#Writing to Excel file

So, I was trying right now to get the urls with this code modification

import requests
from bs4 import BeautifulSoup as bs    
from selenium import webdriver

#set chromodriver.exe path
driver = webdriver.Chrome(executable_path="C:\chromedriver.exe")
#implicit wait
driver.implicitly_wait(0.5)
#maximize browser
driver.maximize_window()
#launch URL
driver.get('https://www.mercadopublico.cl/Home/BusquedaLicitacion')
#identify element
l =driver.find_element_by_xpath("//button[text()='Check it Now']")
#perform click
driver.execute_script("arguments[0].click();", l);

    
url = 'https://www.mercadopublico.cl/Home/BusquedaLicitacion'
    
response = requests.get(url)
print('url:', response.url)
#print('status:', response.status_code)
                        
soup = bs(response.text, "html.parser")
    
all_products = soup.find_all('a', {'href': '#'})
    
for product in all_products:
    itemurl = product.get('onclick') 
    print('itemurl:', itemurl)# hasta aca

#close browser
driver.quit()

but I didn’t get anything print; not sure what failed.

Asked By: kcomarks

||

Answers:

The URL doesn’t change because it is making a post request with the search query.

POST https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar

And the request data is:

{
  "textoBusqueda":"camas",
  "idEstado":"5",
  "codigoRegion":"-1",
  "idTipoLicitacion":"-1",
  "fechaInicio":null,
  "fechaFin":null,
  "registrosPorPagina":"10",
  "idTipoFecha":[],
  "idOrden":"1",
  "compradores":[],
  "garantias":null,
  "rubros":[],
  "proveedores":[],
  "montoEstimadoTipo":[0],
  "esPublicoMontoEstimado":null,
  "pagina":0
}

There is also a cookie that may be needed __RequestVerificationToken_L0hvbWU1.

Then you can get the link to the pop-up in the HTML. It’s inside the onclick property of the link.

If you need more help, just ask in the comment section.

Python Example:
I’ve currently got it working until the final step. When I looked at the csv and json files, I realized they are both invalid. The site seems to attach some html at the bottom of both.
I would recommend to just scrape the data from the last page, rather than downloading the csv/json.

import requests
from bs4 import BeautifulSoup


def get_headers(session):
    res = session.get("https://www.mercadopublico.cl/Home")
    if res.status_code == 200:
        print("Got headers")
        # return res.text
    else:
        print("Failed to get headers")



def search(session):
    data = {
        "textoBusqueda": "Camas",
        "idEstado": "5",
        "codigoRegion": "-1",
        "idTipoLicitacion": "-1",
        "fechaInicio": None,
        "fechaFin": None,
        "registrosPorPagina": "10",
        "idTipoFecha": [],
        "idOrden": "1",
        "compradores": [],
        "garantias": None,
        "rubros": [],
        "proveedores": [],
        "montoEstimadoTipo": [0],
        "esPublicoMontoEstimado": None,
        "pagina": 0
    }
    res = session.post(
        "https://www.mercadopublico.cl/BuscarLicitacion/Home/Buscar",
        data=data)
    if res.status_code == 200:
        print("Search succeeded")
        return res.text
    else:
        print("Search failed with error:", res.reason)



def get_popup_link(html):
    soup = BeautifulSoup(html, "html.parser")
    dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
    # clean onclick links
    clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
    return clean_links


def get_download_html(s, links):
    for link in links:
        res = s.get(link)
        if res.status_code == 200:
            print("fetch succeeded")
            return res.text
        else:
            print("fetch failed with error:", res.reason)

def get_download_links(html):
    soup = BeautifulSoup(html, "html.parser")
    dirty_links = [link["onclick"] for link in soup.select(".lic-block-body a")]
    # clean onclick links
    clean_links = [link.replace("$.Busqueda.verFicha('", "").replace("')", "") for link in dirty_links]
    return clean_links

def main():
    with requests.Session() as s:
        get_headers(s)
        html = search(s)
        popup_links = get_popup_link(html)
        print(popup_links)
        download_html = get_download_html(s, popup_links)
        # print(download_html)

main()
Answered By: Invizi