How to extract an array with Beautiful Soup
Question:
I am trying to extract a list from a script tag in an html file.
How do I extract the list called markers from the script tag ?
from bs4 import BeautifulSoup
import requests
import re
import json
soup = BeautifulSoup(requests.get('url').content, 'html.parser')
scripts = soup.find_all('script')
txt = scripts[22]
print(txt)
The returned data ( value of txt ) is in the following format
<script>jQuery.extend(Drupal.settings, {"basePath":"/","pathPrefix":"en/","setHasJsCookie":0,"ajaxPageState, "markers":[{"latitude":"49.123","longitude":"-123.000","title":"point of interest"}] <script>
Answers:
Using regex is probably your best bet
import re
import json
txt = '<script>jQuery.extend(Drupal.settings, {"basePath":"/","pathPrefix":"en/","setHasJsCookie":0,"ajaxPageState, "markers":[{"latitude":"49.123","longitude":"-123.000","title":"point of interest"}] <script>'
pattern = re.findall(r'"markers":([.*?])s<script>', txt)
lst = json.loads(pattern[0])
print(lst)
Output:
[{'latitude': '49.123', 'longitude': '-123.000', 'title': 'point of interest'}]
I am trying to extract a list from a script tag in an html file.
How do I extract the list called markers from the script tag ?
from bs4 import BeautifulSoup
import requests
import re
import json
soup = BeautifulSoup(requests.get('url').content, 'html.parser')
scripts = soup.find_all('script')
txt = scripts[22]
print(txt)
The returned data ( value of txt ) is in the following format
<script>jQuery.extend(Drupal.settings, {"basePath":"/","pathPrefix":"en/","setHasJsCookie":0,"ajaxPageState, "markers":[{"latitude":"49.123","longitude":"-123.000","title":"point of interest"}] <script>
Using regex is probably your best bet
import re
import json
txt = '<script>jQuery.extend(Drupal.settings, {"basePath":"/","pathPrefix":"en/","setHasJsCookie":0,"ajaxPageState, "markers":[{"latitude":"49.123","longitude":"-123.000","title":"point of interest"}] <script>'
pattern = re.findall(r'"markers":([.*?])s<script>', txt)
lst = json.loads(pattern[0])
print(lst)
Output:
[{'latitude': '49.123', 'longitude': '-123.000', 'title': 'point of interest'}]