How to webscrape image src's from Brave Browser

Question:

I’m trying to get a list of the src values and the source code from a https://search.brave.com/images?q= image search. I don’t really know the problem, because the code works on other sites. Below can you see the code and the html tag that I’m trying to webscrape.

url = "https://search.brave.com/images?q=lfc"
r = requests.get(url)
content = r.content
soup = BeautifulSoup(content, "html.parser")

print("n 1) Insert into .txtn")
fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
with open("txt.txt", "w") as textFile:
    textFile.write(mystr)

print("n 2) Check if src == true n")
with open('txt.txt') as f:
    if 'src' in f.read():
        print(" 2) True n")

print(" 3) Find All Img")
anchors = soup.find_all('img')
all_links = set()
with open("imgUrls.txt", "w") as textFile_1:
    for link in anchors:
        if(link.get('src') != '#'): 
            linkText = url+str(link.get('src'))
            all_links.add(link)
            print(linkText)
            textFile_1.writelines(linkText+'n')

Below is the tag section in Brave browser, it is the img tag with classname : image svelte-qd248k that contains the src tag with a link. I want to gather all the src-links from classname image svelte-qd248k.

Brave browser tags

Asked By: AnxiousDino

||

Answers:

Images data is being retrieved from an API. You can get the info you need like so:

import requests
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}

r = requests.get('https://search.brave.com/api/images?q=lfc&source=web', headers=headers)
df = pd.DataFrame(r.json()['results'])
print(df)

This will return a dataframe – 150 rows x 7 columns:

title   url page_age    safe    source  thumbnail   properties
0   Lfc Images Free : 546 Lfc Photos Free Royalty ...   https://fanniefrenzel.blogspot.com/2021/04/lfc...   2021-05-07T01:04:00.0000000Z    True    fanniefrenzel.blogspot.com  {'src': 'https://imgs.search.brave.com/GvA-lkD...   {'url': 'https://i.pinimg.com/originals/22/90/...
1   [76+] Lfc Wallpaper on WallpaperSafari  https://wallpapersafari.com/lfc-wallpaper/  2021-05-19T00:22:00.0000000Z    True    wallpapersafari.com {'src': 'https://imgs.search.brave.com/H1oCsoq...   {'url': 'https://cdn.wallpapersafari.com/90/14...
2   The LFC Review - YouTube    https://www.youtube.com/channel/UChf7tE8oAh4UK...   2020-05-28T10:59:00.0000000Z    True    YouTube {'src': 'https://imgs.search.brave.com/uuT_1hI...   {'url': 'https://yt3.ggpht.com/a/AATXAJx70Gsn7...
Answered By: Barry the Platipus

This is a complementary answer to Barry the Platipus which also extracts images with pagination using Brave API.

There’s a scrape Brave Images with Python blog post with more detailed info on how to extract all images from the Brave search using pagination.

To scrape Brave images with pagination, you need to use the offset parameter of the URL, which defaults to 0 for the first page, 151 for the second, and so on. Since data is retrieved from all pages, it is necessary to implement a while loop:

while True:
    # pagination will be here

In each iteration of the loop, you need to make a request to the Brave API, pass the created request parameters and headers. Using the json() method, the response is converted into a JSON object for further work:

html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()

The new_page_result list contains all the results on the current page. The new_page_result list is compared with the old_page_result list. If they are the same, then this means that we have reached the last page and there is no more new data. Therefore, you need to break the loop:

new_page_result = html.get('results')

# In the first iteration of the loop, there is no data in the `old_page_result` list. Therefore, the check will fail
if new_page_result == old_page_result:
    break

By looping through the new_page_result list in a for loop, you can get the data. For each result, data such as title, link, source, width, height, and image are retrieved:

for result in new_page_result:
    data.append({
        'title': result.get('title'),
        'link': result.get('url'),
        'source': result.get('source'),
        'width': result.get('properties').get('width'),
        'height': result.get('properties').get('height'),
        'image': result.get('properties').get('url')
    })

After extracting the data, you need to increase the value of the offset parameter by 151. This value also increases on the site when you click on the button responsible for showing more data, that is, we simulate this behavior:

params['offset'] += 151

Also, make sure you’re using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it’s most likely a script that sends a request. Check what’s your user-agent.

Code and full example in online IDE:

import requests, json

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    'q': 'lfc',    # query 
    'offset': 0    # pagination
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    'content-type': 'application/json',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}

brave_images = []
old_page_result = []

while True:
    html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()

    new_page_result = html.get('results')

    if new_page_result == old_page_result:
        break

    for result in new_page_result:
        brave_images.append({
            'title': result.get('title'),
            'link': result.get('url'),
            'source': result.get('source'),
            'width': result.get('properties').get('width'),
            'height': result.get('properties').get('height'),
            'image': result.get('properties').get('url')
        })

    params['offset'] += 151
    old_page_result = new_page_result

print(json.dumps(brave_images, indent=2, ensure_ascii=False))

Output:

[
  {
    "title": "40 Liverpool FC Facts For You To Walk With Them | Facts.net",
    "link": "https://facts.net/lifestyle/sports/liverpool-fc-facts",
    "source": "facts.net",
    "width": 5849,
    "height": 3819,
    "image": "https://facts.net/wp-content/uploads/2020/08/Liverpool-Football-Club-logo.jpg"
  },
  {
    "title": "Lfc Wallpaper (58+ images)",
    "link": "http://getwallpapers.com/collection/lfc-wallpaper",
    "source": "getwallpapers.com",
    "width": 1080,
    "height": 1920,
    "image": "http://getwallpapers.com/wallpaper/full/a/d/d/1114818-most-popular-lfc-wallpaper-1080x1920-desktop.jpg"
  },
  {
    "title": "Lfc Images Free : 546 Lfc Photos Free Royalty Free Stock Photos From Dreamstime - Message us ...",
    "link": "https://fanniefrenzel.blogspot.com/2021/04/lfc-images-free-546-lfc-photos-free.html",
    "source": "fanniefrenzel.blogspot.com",
    "width": 1024,
    "height": 768,
    "image": "https://i.pinimg.com/originals/22/90/59/229059d7b1ce5bc9f1a7e7c5aa25be1d.jpg"
  },
  {
    "title": "LFC Wallpaper Download | MagOne 2016",
    "link": "https://wallpapercarax.blogspot.com/2019/04/lfc-wallpaper-download.html",
    "source": "blogspot.com",
    "width": 1024,
    "height": 768,
    "image": "https://4.bp.blogspot.com/-G-UCe0A1ZdI/XL1RVPcZ_bI/AAAAAAAACsY/-k_Dy7WKtjooOLWrHebK42ynwvkqQM_8ACEwYBhgL/s1600/lfc-wallpaper-download-06.jpg"
  },
  {
    "title": "Report claims Liverpool FC talks ongoing over investment from China Everbright - Liverpool FC ...",
    "link": "https://www.thisisanfield.com/2016/08/report-claims-liverpool-fc-talks-ongoing-investment-chinese-everbright/",
    "source": "This Is Anfield",
    "width": 1200,
    "height": 842,
    "image": "https://www.thisisanfield.com/wp-content/uploads/PROP150218-018-Liverpool_Press_Conf.jpg"
  },
  ... other images
]
Answered By: Artur Chukhrai