How to get var value number from script tag in python?
Question:
In a given .html page, I have a script tag like so:
<script>
some data
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null};
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1};
var pageSize = 48;
</script>
I am trying to get back the totalPage number using soup.
My following code is like so:
pattern= re.compile(r'"totalPage":(d+);', re.MULTILINE | re.DOTALL)
scripts =soup.find_all('script', text=pattern)
if scripts:
match = pattern.search(scripts.text)
print(match)
A blank list is being returned from the above code, whereas I just need the number 12 to be returned as a number. Please do help.
Answers:
There are many ways how to extract the number:
1. Using plain re
import re
from bs4 import BeautifulSoup
html_doc = """
<script>
some data
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null};
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1};
var pageSize = 48;
</script>"""
soup = BeautifulSoup(html_doc, "html.parser")
script = soup.find("script", text=lambda t: t and "totalPage" in t)
print(re.search(r"totalPageD+(d+)", script.text).group(1))
Prints:
12
2. Using js2py
import js2py
script = soup.find("script", text=lambda t: t and "totalPage" in t)
s = "function $() {" + script.text + " return pageList;}"
print(js2py.eval_js(s)()["totalPage"])
Prints:
12
3. Using re
/json
import re
import json
script = soup.find("script", text=lambda t: t and "totalPage" in t)
n = json.loads(re.search(r"pageList = (.*);", script.text).group(1))[
"totalPage"
]
print(n)
Prints:
12
In a given .html page, I have a script tag like so:
<script>
some data
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null};
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1};
var pageSize = 48;
</script>
I am trying to get back the totalPage number using soup.
My following code is like so:
pattern= re.compile(r'"totalPage":(d+);', re.MULTILINE | re.DOTALL)
scripts =soup.find_all('script', text=pattern)
if scripts:
match = pattern.search(scripts.text)
print(match)
A blank list is being returned from the above code, whereas I just need the number 12 to be returned as a number. Please do help.
There are many ways how to extract the number:
1. Using plain re
import re
from bs4 import BeautifulSoup
html_doc = """
<script>
some data
</script>
<body>
some data
</body>
<script>
var breadcrumbData = {"level":0,"currentCategoryName":"Kebutuhan Dapur","currentCategoryId":"5b85712ca3834cdebbbc4363","parentCategoryId":"","parentCategoryName":null};
var pageList = {"totalData":549,"totalPage":12,"pageSize":48,"currentPage":1};
var pageSize = 48;
</script>"""
soup = BeautifulSoup(html_doc, "html.parser")
script = soup.find("script", text=lambda t: t and "totalPage" in t)
print(re.search(r"totalPageD+(d+)", script.text).group(1))
Prints:
12
2. Using js2py
import js2py
script = soup.find("script", text=lambda t: t and "totalPage" in t)
s = "function $() {" + script.text + " return pageList;}"
print(js2py.eval_js(s)()["totalPage"])
Prints:
12
3. Using re
/json
import re
import json
script = soup.find("script", text=lambda t: t and "totalPage" in t)
n = json.loads(re.search(r"pageList = (.*);", script.text).group(1))[
"totalPage"
]
print(n)
Prints:
12