How to extract key info from <script> tag

Question:

I’m trying to extract the user id from this link
https://www.instagram.com/design.kaf/
using bs4 and Regex

Found a JSON key inside script tag called "profile_id"
but I can’t even search that script tag

You can find my try in regex here

https://regex101.com/r/WmlAEc/1

Also I can’t find something I can pull this certain <script> tag

my code :

    url= "https://www.instagram.com/design.kaf/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
                    }
    
    response = requests.request("GET", url, headers=headers)
    soup = BeautifulSoup(response.text, 'lxml') 
    a=str(soup.findall("script"))
    x = re.findall('profile_id":"-?d+"', a)
    id = int(x[0])
    print(id)
Asked By: Hossam Hassan

||

Answers:

you can try this code, it is an approach with loop and string search

import requests
from bs4 import BeautifulSoup

url = 'https://www.instagram.com/design.kaf/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}

r = requests.request("GET", url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.findAll('script')
s = str(s)

id_str, counter = '', 0
counter = 0
while True:
    # our required string format "profile_id":"0123456789....",
    str_to_find = '"profile_id":"'
    index_p = s.find(str_to_find) # returns the index of first character i.e. double quote

    # first number of id will start from index_p + length of the searched string
    if s[index_p+len(str_to_find)+counter] == '"':
        break # iteration will stop when we again find double quote
    else:
        id_str += s[index_p+len(str_to_find)+counter]
        counter += 1

print(id_str) # print 5172989370 in this case
Answered By: iamawesome

Here is another answer using re approach

import requests
from bs4 import BeautifulSoup
import re, ast

url = 'https://www.instagram.com/design.kaf/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.72 Safari/537.36'
}

r = requests.request("GET", url)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.findAll('script')
s = str(s)

# this will print "profile_id":"5172989370"
to_be_find_string = re.findall('"profile_id":"-?d+"', s)[0] # changed you regex by adding a double quote at the beginning

string_formatted_as_dict = '{'+ to_be_find_string + '}'

# it will convert a type <str> formatted as dict to type <dict>
profile_dict = ast.literal_eval(string_formatted_as_dict)

print(profile_dict['profile_id']) # print your user id i.e. 5172989370
Answered By: iamawesome
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.