Webscraping using python for a webpage having "Mehr Anseigen" i.e(eng: Show more)

Question:

I have been trying to scrape a web page and get a few details into an excel or CSV. But unable to get everything since the page is having Mehr Anzeigen which is ‘Show more’ in German.

URL: https://www.gelbeseiten.de/suche/architekturb%c3%bcros/aachen?umkreis=21000

From the above ``URL`` I would like to extract:

<h2> class='Title',

<address> class= 'mod-AdresseKompakt'

<adress> class= 'nbr'

. .

and so on.

Pretty much I would like to load everything automatically (clicking ‘show more’ for 30 times is difficult) and extract all details from the completely loaded website.

I have read some available threads in Stack-Overflow and some blogs, but each one is different for different websites.

Any help would be great!!

Python: I know Python up to some extent, but noob in HTML, and JS.

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException


path_to_chromedriver = '/Users/kuk/Desktop/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)

url = 'https://www.gelbeseiten.de/suche/architekturb%c3%bcros/aachen?umkreis=21000'
browser.get(url)


h2 = browser.find_elements(By.TAG_NAME, 'h2')
for item in h2:  
print(item.text)
Asked By: Kuladeep

||

Answers:

I have a function (linkToSoup_selenium) that can click through the button a set number of times and then scrape the page

# import pandas # for saving as table
# from linkToSoup_selenium import * ## OR PASTE HERE

cfList = (
    ['//div[@id="cmpbox"]//span[@id="cmpbntyestxt"]'] # "Akzeptieren" - for cookies, I think
    + ['//a[@id="mod-LoadMore--button"]']*30 # click LoadMore 30x
)

soup = linkToSoup_selenium(
     'https://www.gelbeseiten.de/suche/architekturb%c3%bcros/aachen?umkreis=21000'
    , ecx='//article[327]' # wait for listing #327 to load
    , clickFirst=cfList  # cookies + 30xLoadMore
    , strictMode=False # (is False by default but) do NOT set as True 
)

(You can either save it as a file and import or paste it at the beginning of your code.) Through the clickFirst parameter [a list of xpaths of elements to click], add the XPath of "Mehr Anzeigen" as many times as you want to click (it’s better to overestimate, since it will only print an error message if it can’t click, but since it’s inside an isolated try block, the rest of program will continue – that’s why strictMode=False is important here).


And then, to get the details, you can define a function as below (or copy the full version)

def getListingDetails(lSoup, refDict):
    detList = {}
    for k, (sel, attr, altr) in refDict.items():
        s = lSoup.select_one(sel)
        if s is None: detVal = None
        elif attr == '': detVal = s.get_text(' ', strip=True)
        else: detVal = s.get(attr)

        if detVal is None or type(altr) != list: altr = [] 
        for a, v in altr:
            if a == 'word1': detVal = detVal.split(v)[0]
            elif a == 'replace': detVal = detVal.replace(v[0], '', v[1]) 

        detList[k] = detVal
    return detList

as well as a dictionary of selectors for each detail

selRef = {
    'Title': ('h2[data-wipe-name="Titel"]', '', ''),
    'Branch': ('p.mod-Treffer--besteBranche', '', ''),
    'Address': ('p[data-wipe-name="Adresse"]', '', ''),
    'Contact': ('p[data-wipe-name="Kontaktdaten"]', '', ''),
    'Website': ('a.contains-icon-homepage[href]', 'href', ''),
    'Email': ('a.contains-icon-email[href^="mailto:"]', 'href', [
        ('replace', ('mailto:', 1)), ('word1', '?')
    ]), 'DetailsPage': ('a.contains-icon-details[href]', 'href', '')
}

Then, you can simply use list comprehension with getListingDetails [from above] and select, and then save with pandas

if soup:
    lDets = [
        getListingDetails(a, selRef)
        for a in soup.select('article[id^="treffer_"]')
    ]

    pandas.DataFrame(lDets).to_csv('listingDetails.csv', index=False) # save

(lDets is a list of dictionaries in the same format as selRef but with details of the ads instead of selectors.)

The resulting CSV looks like Output CSV



[EDIT] Selenium without BeautifulSoup

First, to clear the cookies popup and then repeatedly load more without the function,

ac_xpath = '//div[@id="cmpbox"]//span[@id="cmpbntyestxt"]'
WebDriverWait(browser, 25).until(EC.visibility_of_all_elements_located((By.XPATH, ac_xpath)))
browser.find_element(By.XPATH, ac_xpath).click()

loadMore_xpath = '//a[@id="mod-LoadMore--button"]'
loadMore_maxClicks = 50
for lm_clickCt in range(loadMore_maxClicks):
    print('', end=f'rClicked "Mehr Anzeigen" {lm_clickCt} times')

    WebDriverWait(browser, 25).until(EC.visibility_of_all_elements_located((By.XPATH, loadMore_xpath)))
    loadMore_btn = browser.find_elements(By.XPATH, loadMore_xpath)
    browser.execute_script("arguments[0].scrollIntoView(false);", loadMore_btn[0])

    if not loadMore_btn: break
    loadMore_btn[0].click()
print('')

It will stop trying to load more once the button disappears, or after clicking a maximum number of time (50 as it is); if you don’t want to set a maximum, use while True instead of for lm_clickCt in range(loadMore_maxClicks), but the button stops working for me sometimes (even when I’m using the browser directly) and I didn’t want the program to hang due to an infinite loop.

For extracting details, selRef can remain as is but getListingDetails will need some adjustments:

# def getListingDetails(lSoup, refDict): # becomes 
def getListingDetails(listingEl, refDict):
        # s = lSoup.select_one(sel) # becomes 
        s = listingEl.find_elements(By.CSS_SELECTOR, sel)
            # detVal = s.get_text(' ', strip=True) # becomes 
            detVal = s[0].get_attribute('innerText').strip()
            # detVal = s.get(attr) # becomes 
            detVal = s[0].get_attribute(attr)

and creating lDets will also be altered a little

lDets = [
    getListingDetails(a, selRef) for a in
    browser.find_elements(By.CSS_SELECTOR, 'article[id^="treffer_"]')
]
Answered By: Driftr95
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.