Scrape img src with try/except in beautifulsoup

Question:

images = []
    try:
        images = [img["src"] for img in soup.select(".img-lazy")]
    except:
        images = [img["src"] for img in soup.select(".img-thumb")]
    else:
        images = [img["src"] for img in soup.select(".author-bio")]

I try to scrape image src from different pages. If works fine only with try and except, but some pages have image in different class name so I add another except condition. It shows error then I add else condition. But now it only scrapes else condition data. I want that first it look for .img-lazy then for .img-thumb and in last for .author-bio class.

Asked By: nasir

||

Answers:

First of all: You should (almost) never use bare except clauses like that. Details about this are all over this platform.

In this case, you are shooting yourself in the foot, because you can’t know what exception exactly is raised when that except is triggered.

Also, with this logic, whenever the code inside the try block executes without a problem (thus assigning the images variable), the except block is skipped and then the else block is executed. This results in the images variable being re-assigned (i.e. overwritten) in that block.

This is the logic behind try-except-else constructs. (You should read up on that.)

If I understand your requirements and the documentation for select correctly, you can just do this instead of that whole try-except mess:

images = [img["src"] for img in soup.select(".img-lazy,.img-thumb,.author-bio")]

That select call should return you all elements that match any of those class selectors.

However, I would be careful here, unless you know for certain that every HTML element with any of those classes is in fact an <img> (or more specifically, has a src attribute). Because if any of them does not have src attribute, that code will raise a KeyError at this point: img["src"]

I would suggest being as precise as possible with the selector:

images = [
    img["src"] 
    for img in soup.select(
        "img[src].img-lazy,img[src].img-thumb,img[src].author-bio"
    )
]

For example, this img[src].img-lazy will only grab <img> tags that have a src attribute and the class img-lazy.

Answered By: Daniil Fajnberg