Scraping More than rendered Data with Beautiful Soup

Question:

I’m scraping apps names from Google Play Store and for each URL as input I get only 60apps (because the website rendered 60apps if the user doesn’t scroll down). How is it working and how can I scrape all the apps from a page using BeautifulSoup and/or Selenium ?

Thank you

Here is my code :

urls = []

urls.extend(["https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid"])

for i in urls:
    response = get(i)
    html_soup = BeautifulSoup(response.text, 'html.parser')
    app_container = html_soup.find_all('div', class_="card no-rationale square-cover apps small")
    file = open("./InputFiles/applications.txt","w+")
    for i in range(0, len(app_container)):
        #print(app_container[i].div['data-docid'])
        file.write(app_container[i].div['data-docid'] + "n")

    file.close()
num_lines = sum(1 for line in open('./InputFiles/applications.txt'))
print("Applications : " + str(num_lines) )
Asked By: userHG

||

Answers:

In this case You need to use Selenium . I try it for you an get the all apps . I will try to explain hope will understand.

Using Selenium is more powerful than other Python function .I used ChromeDriver so If you don’t install yet You can install it in

http://chromedriver.chromium.org/

from time import sleep
from selenium import webdriver


options = webdriver.ChromeOptions()
driver=webdriver.Chrome(chrome_options=options, 
executable_path=r'This part is your Driver path')
driver.get('https://play.google.com/store/apps/category/NEWS_AND_MAGAZINES/collection/topselling_paid')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") ## Scroll to bottom of page with using driver
sleep(5) ## Give a delay for allow to page scroll . If we dont program will already take 60 element without letting scroll
x = driver.find_elements_by_css_selector("div[class='card-content id-track-click id-track-impression']") ## Declare which class

for a in x:
  print a.text
driver.close()

OUTPUT :

1. Pocket Casts
Podcast Media LLC
₺24,99
2. Broadcastify Police Scanner Pro
RadioReference.com LLC
₺18,99
3. Relay for reddit (Pro)
DBrady
₺8,00
4. Sync for reddit (Pro)
Red Apps LTD
₺15,00
5. reddit is fun golden platinum (unofficial)
TalkLittle
₺9,99
... **UP TO 75**

Note :

Dont mind the money. Its my countr currency so It will change in yours.

UPDATE ACCORDİNG TO YOUR COMMENT:

The same data-docid is also in span tag.You can get it with using get_attribute . Just add below codes into your project.

y = driver.find_elements_by_css_selector("span[class=preview-overlay-container]")

for b in y :
   print b.get_attribute('data-docid')

OUTPUT

au.com.shiftyjelly.pocketcasts
com.radioreference.broadcastifyPro
reddit.news
com.laurencedawson.reddit_sync.pro
com.andrewshu.android.redditdonation
com.finazzi.distquakenoads
com.twitpane.premium
org.fivefilters.kindleit
.... UP TO 75
Answered By: Omer Tekbiyik

Google Play has recently changed the user interface and structure of links and the display of information. I recently wrote a Scrape Google Play Search Apps in Python blog where I described the whole process in detail with more data.

In order to access all apps, you need to scroll to the bottom of the page.
After that, you can start extracting the app names and then writing them into a file. Extraction selectors have also changed e.g. new selectors.

Code and full example in online IDE:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

urls = []

urls.extend(["https://play.google.com/store/apps?device=phone&hl=en_GB&gl=US"])

service = Service(ChromeDriverManager().install())

options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument("--lang=en")
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Chrome(service=service, options=options)

for url in urls:
    driver.get(url)

    # scrolling page
    while True:
        try:
            driver.execute_script("document.querySelector('.snByac').click();")
            time.sleep(2)
            break
        except:
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(2)
    
    
    soup = BeautifulSoup(driver.page_source, "lxml")
    driver.quit()
    
    with open("applications.txt", "w+") as file:
        for result in soup.select(".Epkrse"):
            file.write(result.text + "n")

    num_lines = sum(1 for line in open("applications.txt"))
    print("Applications : " + str(num_lines))

Output:

Applications : 329

Also, you can use Google Play Apps Store API from SerpApi. It will bypass blocks from search engines and you don’t have to create the parser from scratch and maintain it.

Code example:

from serpapi import GoogleSearch
import os

params = {
    # https://docs.python.org/3/library/os.html#os.getenv
    'api_key': os.getenv('API_KEY'),    # your serpapi api
    'engine': 'google_play',            # SerpApi search engine
    'store': 'apps'                     # Google Play Games
}

data = []

while True:
    search = GoogleSearch(params)           # where data extraction happens on the SerpApi backend
    result_dict = search.get_dict()         # JSON -> Python dict

    if result_dict.get('organic_results'):
        for result in result_dict.get('organic_results'):
            for item in result['items']:
                data.append(item['title'])
                
        next_page_token = result_dict['serpapi_pagination']['next_page_token']
        params['next_page_token'] = next_page_token
    else:
        break

with open('applications.txt', 'w+') as file:
    for app in data:
        file.write(app + "n")
        
num_lines = sum(1 for line in open('applications.txt'))
print('Applications : ' + str(num_lines))

The output will be the same.

Answered By: Artur Chukhrai
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.