Unable to extract full src attribute from img tag

Question:

[From trying to scrape Google Search results with requests+bs4, like if soup = BeautifulSoup(requests.get('https://www.google.com/search?q=best+laptops+2022').content), then any img tag you might find with soup.select('img[id^="dimg"]').]

It seemed simple enough at first – just use soup.find and then .get('src') or .attrs['src'], but major chunks of the src has been replaced with "/////"
extraction code


The value is actually much longer:
DevTools screenshot

What’s baffling me is that I saved str(soup) as a html file, and also used display(HTML(str(soup))) and with both, the image is being rendered just fine – I can even copy the full src from inspecting the file.
Colab output with fully rendered images


But even

str(soup).split('id="dimg_179" src="')[1].split('"')[0]

produces the same .

I would very much appreciate any explanation of this behavior and/or some suggestions of how to extract the actual src.

Asked By: Driftr95

||

Answers:

I figured it out as I was writing in the question about inspecting the saved html file – I only used devtools and didn’t open and see all the saved html code itself.

When I open the html on vscode, there are actually two occurrences of the id dimg_179 – the one in the img tag itself, and another in a script tag:

<script nonce="w8n56Ul9BxlnjUkznIswGw">(function(){var s='';var i=['dimg_179'];_setImagesSrc(i,s);})();</script>

so I can extract it with a bit of extra effort:

dimgTags = soup.find_all(lambda l: 'dimg_179' in str(l))

# get rid of parent tags
dimgTags = [d for d in dimgTags if not [t for t in dimgTags if d in t.parents]]
dimgTags = [d for d in dimgTags if d.name != 'img']

if dimgTags:
    dimgSrc = str(dimgTags[0]).split("var s='")[1].split("'")[0]
    display(HTML(f'<img src="{dimgSrc}">'))

I’d still love to know if there are better ways though!

Answered By: Driftr95
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.