How to get var value number from script tag in python?

Question:

In a given .html page, I have a script tag like so:

<script>
some data 
</script>
<body>
some data
</body>

<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null}; 
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1}; 
var pageSize = 48;
</script>

I am trying to get back the totalPage number using soup.

My following code is like so:

pattern= re.compile(r'"totalPage":(d+);', re.MULTILINE | re.DOTALL) 
scripts =soup.find_all('script', text=pattern)
   if scripts:
   match = pattern.search(scripts.text)
   print(match)

A blank list is being returned from the above code, whereas I just need the number 12 to be returned as a number. Please do help.

Asked By: gndps

||

Answers:

There are many ways how to extract the number:

1. Using plain re

import re
from bs4 import BeautifulSoup


html_doc = """
<script>
some data 
</script>
<body>
some data
</body>

<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null}; 
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1}; 
var pageSize = 48;
</script>"""

soup = BeautifulSoup(html_doc, "html.parser")

script = soup.find("script", text=lambda t: t and "totalPage" in t)
print(re.search(r"totalPageD+(d+)", script.text).group(1))

Prints:

12

2. Using js2py

import js2py

script = soup.find("script", text=lambda t: t and "totalPage" in t)
s = "function $() {" + script.text + " return pageList;}"
print(js2py.eval_js(s)()["totalPage"])

Prints:

12

3. Using re/json

import re
import json

script = soup.find("script", text=lambda t: t and "totalPage" in t)
n = json.loads(re.search(r"pageList = (.*);", script.text).group(1))[
    "totalPage"
]
print(n)

Prints:

12
Answered By: Andrej Kesely
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.