Scrape information about Visit Page and App Name on google play store

Question:

I have created the code below to scrape the app name and Visit Page Url from the google play store page.

ASOS – Get ASOS (Line 1120)

Visit website – Get http://www.asos.com – (q=)(Line 1121 source code)

url = 'https://play.google.com/store/apps/details?id=com.asos.app'
r = requests.get(url)

final=[]
for line in r.iter_lines():
    if count == 1120:
        soup = BeautifulSoup(line)
        for row in soup.findAll('a'):
                u=row.find('span')
                t = u.string
                print t
    elif count == 1121:
        soup = BeautifulSoup(line)
        for row in soup.findAll('a'):
                u=row.get('href')
                print u
    count = count + 1  

I can’t seem to print the HTML here. Please open edits for that. But Please help me here!

Asked By: Blabber

||

Answers:

BeautifulSoup provides a great deal of functions that you should be taking advantage of.

For starters, your script can be cut down to the following:

import requests
from bs4 import BeautifulSoup

url = 'https://play.google.com/store/apps/details?id=com.asos.app'
r = requests.get(url)

soup = BeautifulSoup(r.content, "html.parser")

for a in soup.find_all('a', {'class': 'dev-link'}):
    print "Found the URL:", a['href']

BS4 can parse the raw HTML content and you can iterate through it via the data type. In this scenario, you want a particular href link of class name dev-link. Doing so, gets you the following output:

Found the URL: https://www.google.com/url?q=http://www.asos.com&sa=D&usg=AFQjCNGl4lHIgnhUR3y414Q8idAzJvASqw
Found the URL: mailto:[email protected]
Found the URL: https://www.google.com/url?q=http://www.asos.com/infopages/pgeprivacy.aspx&sa=D&usg=AFQjCNH-hW1H0fYlsCjp4ERbVh29epqaXA

I’m sure you can tweak it a bit more to get the results you want but please refer to BS4 for more information ==> https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Answered By: Carlos

Google Play Store has been redesigned, now it is dynamic and the data we need stored as inline JSON.

You can still use selenium or playwright webdriwer to parse it. However, in our case, we can use the BeautifulSoup and regular expression to extract pretty much everything from the app page.

Firstly, extract certain <script> element from all <script> elements in the HTML and transform in to a dict with json.loads():

basic_app_info = json.loads(re.findall(r"<script nonce="w+" type="application/ld+json">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])

After that we can access dict transformed from json.loads() and extract the data:

app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["url"] = basic_app_info.get("url")

Don’t forget to use user-agent in the request, then the site will assume that you’re a user and display information.

from bs4 import BeautifulSoup
import requests, re, json, lxml

# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}

# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
    "id": "com.asos.app",          # app name
    "gl": "US",                    # country of the search
    "hl": "en_GB"                  # language of the search
}

html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")

app_data = {
  "basic_info":{}
}
                     
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce="w+" type="application/ld+json">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
    
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["url"] = basic_app_info.get("url")
print(json.dumps(app_data, indent=2))

Example output

[
  {
  "basic_info": {
    "name": "ASOS",
    "url": "https://play.google.com/store/apps/details/ASOS?id=com.asos.app&hl=en_GB&gl=US"
  }
}
]

Also you can use Google Play Store API from SerpApi. It’s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.

SerpApi code example:

from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
import os, json

params = {
    "api_key": os.getenv("API_KEY"),         # your serpapi api key
    "engine": "google_play_product",         # parsing engine
    "store": "apps",                         # app page
    "gl": "us",                              # country of the search
    "product_id": "com.asos.app",            # low review count example to show it exits the while loop
    "all_reviews": "true"                    # shows all reviews
}

search = GoogleSearch(params)                # where data extraction happens
results = search.get_dict()
app_name = results['product_info']['authors'][0]['name']
app_url = results['product_info']['authors'][0]['link']
print(app_name, app_url, sep='n')

Output:

ASOS
https://play.google.com/store/apps/developer?id=ASOS

There’s a Scrape Google Play Store App in Python blog post if you need a little bit more code explanation.

Disclaimer, I work for SerpApi.

Answered By: Denis Skopa
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.