Scrape information about Visit Page and App Name on google play store
Question:
I have created the code below to scrape the app name and Visit Page Url from the google play store page.
ASOS – Get ASOS (Line 1120)
Visit website – Get http://www.asos.com – (q=)(Line 1121 source code)
url = 'https://play.google.com/store/apps/details?id=com.asos.app'
r = requests.get(url)
final=[]
for line in r.iter_lines():
if count == 1120:
soup = BeautifulSoup(line)
for row in soup.findAll('a'):
u=row.find('span')
t = u.string
print t
elif count == 1121:
soup = BeautifulSoup(line)
for row in soup.findAll('a'):
u=row.get('href')
print u
count = count + 1
I can’t seem to print the HTML here. Please open edits for that. But Please help me here!
Answers:
BeautifulSoup provides a great deal of functions that you should be taking advantage of.
For starters, your script can be cut down to the following:
import requests
from bs4 import BeautifulSoup
url = 'https://play.google.com/store/apps/details?id=com.asos.app'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
for a in soup.find_all('a', {'class': 'dev-link'}):
print "Found the URL:", a['href']
BS4 can parse the raw HTML content and you can iterate through it via the data type. In this scenario, you want a particular href
link of class name dev-link
. Doing so, gets you the following output:
Found the URL: https://www.google.com/url?q=http://www.asos.com&sa=D&usg=AFQjCNGl4lHIgnhUR3y414Q8idAzJvASqw
Found the URL: mailto:[email protected]
Found the URL: https://www.google.com/url?q=http://www.asos.com/infopages/pgeprivacy.aspx&sa=D&usg=AFQjCNH-hW1H0fYlsCjp4ERbVh29epqaXA
I’m sure you can tweak it a bit more to get the results you want but please refer to BS4 for more information ==> https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Google Play Store has been redesigned, now it is dynamic and the data we need stored as inline JSON.
You can still use selenium
or playwright
webdriwer to parse it. However, in our case, we can use the BeautifulSoup
and regular expression
to extract pretty much everything from the app page.
Firstly, extract certain <script>
element from all <script>
elements in the HTML and transform in to a dict
with json.loads()
:
basic_app_info = json.loads(re.findall(r"<script nonce="w+" type="application/ld+json">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
After that we can access dict
transformed from json.loads()
and extract the data:
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["url"] = basic_app_info.get("url")
Don’t forget to use user-agent
in the request, then the site will assume that you’re a user and display information.
from bs4 import BeautifulSoup
import requests, re, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
"id": "com.asos.app", # app name
"gl": "US", # country of the search
"hl": "en_GB" # language of the search
}
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
app_data = {
"basic_info":{}
}
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce="w+" type="application/ld+json">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["url"] = basic_app_info.get("url")
print(json.dumps(app_data, indent=2))
Example output
[
{
"basic_info": {
"name": "ASOS",
"url": "https://play.google.com/store/apps/details/ASOS?id=com.asos.app&hl=en_GB&gl=US"
}
}
]
Also you can use Google Play Store API from SerpApi. It’s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
SerpApi code example:
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
import os, json
params = {
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google_play_product", # parsing engine
"store": "apps", # app page
"gl": "us", # country of the search
"product_id": "com.asos.app", # low review count example to show it exits the while loop
"all_reviews": "true" # shows all reviews
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict()
app_name = results['product_info']['authors'][0]['name']
app_url = results['product_info']['authors'][0]['link']
print(app_name, app_url, sep='n')
Output:
ASOS
https://play.google.com/store/apps/developer?id=ASOS
There’s a Scrape Google Play Store App in Python blog post if you need a little bit more code explanation.
Disclaimer, I work for SerpApi.
I have created the code below to scrape the app name and Visit Page Url from the google play store page.
ASOS – Get ASOS (Line 1120)
Visit website – Get http://www.asos.com – (q=)(Line 1121 source code)
url = 'https://play.google.com/store/apps/details?id=com.asos.app'
r = requests.get(url)
final=[]
for line in r.iter_lines():
if count == 1120:
soup = BeautifulSoup(line)
for row in soup.findAll('a'):
u=row.find('span')
t = u.string
print t
elif count == 1121:
soup = BeautifulSoup(line)
for row in soup.findAll('a'):
u=row.get('href')
print u
count = count + 1
I can’t seem to print the HTML here. Please open edits for that. But Please help me here!
BeautifulSoup provides a great deal of functions that you should be taking advantage of.
For starters, your script can be cut down to the following:
import requests
from bs4 import BeautifulSoup
url = 'https://play.google.com/store/apps/details?id=com.asos.app'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
for a in soup.find_all('a', {'class': 'dev-link'}):
print "Found the URL:", a['href']
BS4 can parse the raw HTML content and you can iterate through it via the data type. In this scenario, you want a particular href
link of class name dev-link
. Doing so, gets you the following output:
Found the URL: https://www.google.com/url?q=http://www.asos.com&sa=D&usg=AFQjCNGl4lHIgnhUR3y414Q8idAzJvASqw
Found the URL: mailto:[email protected]
Found the URL: https://www.google.com/url?q=http://www.asos.com/infopages/pgeprivacy.aspx&sa=D&usg=AFQjCNH-hW1H0fYlsCjp4ERbVh29epqaXA
I’m sure you can tweak it a bit more to get the results you want but please refer to BS4 for more information ==> https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Google Play Store has been redesigned, now it is dynamic and the data we need stored as inline JSON.
You can still use selenium
or playwright
webdriwer to parse it. However, in our case, we can use the BeautifulSoup
and regular expression
to extract pretty much everything from the app page.
Firstly, extract certain <script>
element from all <script>
elements in the HTML and transform in to a dict
with json.loads()
:
basic_app_info = json.loads(re.findall(r"<script nonce="w+" type="application/ld+json">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
After that we can access dict
transformed from json.loads()
and extract the data:
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["url"] = basic_app_info.get("url")
Don’t forget to use user-agent
in the request, then the site will assume that you’re a user and display information.
from bs4 import BeautifulSoup
import requests, re, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
# https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls
params = {
"id": "com.asos.app", # app name
"gl": "US", # country of the search
"hl": "en_GB" # language of the search
}
html = requests.get("https://play.google.com/store/apps/details", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
app_data = {
"basic_info":{}
}
# [11] index is a basic app information
# https://regex101.com/r/zOMOfo/1
basic_app_info = json.loads(re.findall(r"<script nonce="w+" type="application/ld+json">({.*?)</script>", str(soup.select("script")[11]), re.DOTALL)[0])
app_data["basic_info"]["name"] = basic_app_info.get("name")
app_data["basic_info"]["url"] = basic_app_info.get("url")
print(json.dumps(app_data, indent=2))
Example output
[
{
"basic_info": {
"name": "ASOS",
"url": "https://play.google.com/store/apps/details/ASOS?id=com.asos.app&hl=en_GB&gl=US"
}
}
]
Also you can use Google Play Store API from SerpApi. It’s a paid API with the free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
SerpApi code example:
from serpapi import GoogleSearch
from urllib.parse import (parse_qsl, urlsplit)
import os, json
params = {
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google_play_product", # parsing engine
"store": "apps", # app page
"gl": "us", # country of the search
"product_id": "com.asos.app", # low review count example to show it exits the while loop
"all_reviews": "true" # shows all reviews
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict()
app_name = results['product_info']['authors'][0]['name']
app_url = results['product_info']['authors'][0]['link']
print(app_name, app_url, sep='n')
Output:
ASOS
https://play.google.com/store/apps/developer?id=ASOS
There’s a Scrape Google Play Store App in Python blog post if you need a little bit more code explanation.
Disclaimer, I work for SerpApi.