Scraping webpage with show more button

Question:

I want to scrape google scholar pages with ‘show more’ button. Using help from this platform for a previous question that I had asked, I wrote the following code so that the ‘show more’ button is clicked. However, I am still having a problem. For profiles with several ‘show more’ buttons, only the first one is getting clicked. I dont understand why this happens. I would appreciate any help.

from selenium import webdriver
import time
from bs4 import BeautifulSoup
import pandas as pd

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
chrome_path = r"C:Usersish05Desktoppythonchromedriver.exe"
driver = webdriver.Chrome(chrome_path)

driver.get("https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en")
time.sleep(3)
show_more = driver.find_elements_by_tag_name('button')
for x in range(len(show_more)):
    if show_more[x].is_displayed():
      driver.execute_script("arguments[0].click();", show_more[x])
      time.sleep(3)

Answers:

First of all I see there are 19 elements with tag name button on that page while only 1 of them is Show more button that can be located by the following xpath: //button[.//*[contains(text(),'Show more')]]
So only clicking this element will click on Show more while clicking on the other buttons will perform other actions, some of the button elements there are also not clickable.

Answered By: Prophet

The reason it runs one because it appears one on each page.

You need to use infinite loop and then search on the page if there then click else no more button break from the loop.

from selenium import webdriver
import time
chrome_path = r"C:Usersish05Desktoppythonchromedriver.exe"
driver = webdriver.Chrome(chrome_path)

driver.get("https://scholar.google.com/citations?user=cp-8uaAAAAAJ&hl=en")
time.sleep(3)
while True:
    try:       
        show_more = driver.find_element_by_xpath("//button[.//span[text()='Show more'] and not(@disabled)]")
        driver.execute_script("arguments[0].click();", show_more)
        print("Show more button clicked")
        time.sleep(2)
    except:
        print("No more Show more button")
        break

you will see below output on console

Show more button clicked
Show more button clicked
Show more button clicked
Show more button clicked
Show more button clicked
No more Show more button
Answered By: KunduK

You can achieve it without browser automation by utilizing pagesize and cstart URL parameters which stands for articles per page and page number accordingly.

To paginate to the last page and exit the while loop, you can check if a certain selector is present that says "There are no articles in this profile." to exit the while loop.


Code and full example in the online IDE:

import pandas as pd
from bs4 import BeautifulSoup
import requests, lxml


def bs4_scrape_articles():
    params = {
        "user": "cp-8uaAAAAAJ",       # user-id
        "hl": "en",                   # language
        "gl": "us",                   # country to search from
        "cstart": 0,                  # articles page. 0 is the first page
        "pagesize": "100"             # articles per page
    }

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
    }

    all_articles = []

    articles_is_present = True

    while articles_is_present:
        html = requests.post("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
        soup = BeautifulSoup(html.text, "lxml")

        for index, article in enumerate(soup.select("#gsc_a_b .gsc_a_t"), start=1):
            article_title = article.select_one(".gsc_a_at").text
            article_link = f'https://scholar.google.com{article.select_one(".gsc_a_at")["href"]}'
            article_authors = article.select_one(".gsc_a_at+ .gs_gray").text
            article_publication = article.select_one(".gs_gray+ .gs_gray").text

            print(f"article #{int(params['cstart']) + index}",
                  article_title,
                  article_link,
                  article_authors,
                  article_publication, sep="n")

            all_articles.append({
                "title": article_title,
                "link": article_link,
                "authors": article_authors,
                "publication": article_publication
            })

        # this selector is checking for the .class that contains: "There are no articles in this profile."
        # example link: https://scholar.google.com/citations?user=VjJm3zYAAAAJ&hl=en&cstart=500&pagesize=100
        if soup.select_one(".gsc_a_e"):
            articles_is_present = False
        else:
            params["cstart"] += 100  # paginate to the next page
    
    # save to CSV
    # pd.DataFrame(data=all_articles).to_csv(f"google_scholar_{params['user']}_articles.csv", encoding="utf-8", index=False)


bs4_scrape_articles()


# part of the output:
'''
article #1
The skill content of recent technological change: An empirical exploration
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cp-8uaAAAAAJ&pagesize=100&citation_for_view=cp-8uaAAAAAJ:TFP_iSt0sucC
DH Autor, F Levy, RJ Murnane
The Quarterly Journal of Economics 118 (4), 1279, 2003

article #442
Preliminary and Incomplete November 3, 1999 Comments Appreciated Not for Quotation
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cp-8uaAAAAAJ&cstart=400&pagesize=100&citation_for_view=cp-8uaAAAAAJ:Zph67rFs4hoC
F Levy, A Beamish, RJ Murnane, D Autor
'''

Alternatively, you can achieve the same thing by using Google Scholar Author API from SerpApi. It’s a paid API with a free plan.

The difference is that you don’t have to write the parser from scratch, maintain it over time, figure out how to scale it and how bypass blocks from Google.

Code to integrate:

import os
import pandas as pd
from serpapi import GoogleScholarSearch


def serpapi_scrape_articles():
    params = {
        "api_key": os.getenv("API_KEY"),   # SerpApi API Key
        "engine": "google_scholar_author",
        "hl": "en",
        "author_id": "cp-8uaAAAAAJ",
        "start": "0",                      # page number
        "num": "100"                       # articles per page
    }

    search = GoogleScholarSearch(params)

    all_articles = []

    articles_is_present = True

    while articles_is_present:
        results = search.get_dict()

        for index, article in enumerate(results["articles"], start=1):
            title = article["title"]
            link = article["link"]
            authors = article["authors"]
            publication = article.get("publication")
            citation_id = article["citation_id"]

            print(f"article #{int(params['start']) + index}",
                  title,
                  link,
                  authors,
                  publication,
                  citation_id, sep="n")

            all_articles.append({
                "title": title,
                "link": link,
                "authors": authors,
                "publication": publication,
                "citation_id": citation_id
            })

        if "next" in results.get("serpapi_pagination", []):
            # split URL in parts as a dict() and update "search" variable to a new page
            search.params_dict.update(dict(parse_qsl(urlsplit(results["serpapi_pagination"]["next"]).query)))
        else:
            articles_is_present = False

    # pd.DataFrame(data=all_articles).to_csv(f"serpapi_google_scholar_{params['author_id']}_articles.csv", encoding="utf-8", index=False)


serpapi_scrape_articles()

# part of the output:
'''
article #1
The skill content of recent technological change: An empirical exploration
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cp-8uaAAAAAJ&pagesize=100&citation_for_view=cp-8uaAAAAAJ:TFP_iSt0sucC
DH Autor, F Levy, RJ Murnane
The Quarterly Journal of Economics 118 (4), 1279, 2003
cp-8uaAAAAAJ:TFP_iSt0sucC

article #442
Preliminary and Incomplete November 3, 1999 Comments Appreciated Not for Quotation
https://scholar.google.com/citations?view_op=view_citation&hl=en&user=cp-8uaAAAAAJ&cstart=400&pagesize=100&citation_for_view=cp-8uaAAAAAJ:Zph67rFs4hoC
F Levy, A Beamish, RJ Murnane, D Autor
None
cp-8uaAAAAAJ:Zph67rFs4hoC
'''

If you want to scrape all organic results, there is a dedicated Scrape historic Google Scholar results to CSV, SQLite using Python blog post of mine at SerpApi.

Disclaimer, I work for SerpApi.

Answered By: Dmitriy Zub