Web scraping: .find doesn't find string in line of web page

Question:

I am writing my first python program and hope that you can help me with my current problem.

I try to extract data from a website and I checked the source of the page where a certain string (lets say "thisstring") is part of a line.

In the HTML-code the string is listed under :

<script>
      anotherstring;
      thisstring = {...};

My current code:

import requests
from bs4 import BeautifulSoup
 
page = requests.get('www.somewebadress.com')
soup = BeautifulSoup(page.content, 'html.parser')
lines = soup.find_all('script')

x = 0 #counter for script which returns the correct number of <script> parts in the html-code

for line in lines:
    x = x + 1
    txt = line.find('thisstring') #didnt work with "thisstring" either
    if txt == None:
        print("not found")
    else:
        print("found")
    
print(x)

I tried a lot different solutions I found in the www but "thisstring" is never found even if python printed it out with print(line).
I think it is quite simple but I tried the whole day to find the correct code.

Does anyone have an idea?

I found several code samples in stackoverflow and other python tutorials for web scraping but non of these worked. I use Spyder. Could this be a problem?

Asked By: My Ka

||

Answers:

Based on your comments you can use re module to extract the variable:

import re

html_text = """
<html>
<script>
    otherscript;
</script>

<script>
      anotherstring;
      thisstring = {"data1": 1, "data2": 2};
</script>
</html>"""

# or:
# html_text = requests.get(...).text

data = re.search(r"thisstring = ({.*});", html_text).group(1)
print(data)

Prints:

{"data1": 1, "data2": 2}

Then you can use ast.literal_eval, json or js2py to convert the string to python object:

import json

data = json.loads(data)
print(data)
Answered By: Andrej Kesely
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.