How to webscrape image src's from Brave Browser
Question:
I’m trying to get a list of the src values and the source code from a https://search.brave.com/images?q= image search. I don’t really know the problem, because the code works on other sites. Below can you see the code and the html tag that I’m trying to webscrape.
url = "https://search.brave.com/images?q=lfc"
r = requests.get(url)
content = r.content
soup = BeautifulSoup(content, "html.parser")
print("n 1) Insert into .txtn")
fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
with open("txt.txt", "w") as textFile:
textFile.write(mystr)
print("n 2) Check if src == true n")
with open('txt.txt') as f:
if 'src' in f.read():
print(" 2) True n")
print(" 3) Find All Img")
anchors = soup.find_all('img')
all_links = set()
with open("imgUrls.txt", "w") as textFile_1:
for link in anchors:
if(link.get('src') != '#'):
linkText = url+str(link.get('src'))
all_links.add(link)
print(linkText)
textFile_1.writelines(linkText+'n')
Below is the tag section in Brave browser, it is the img
tag with classname : image svelte-qd248k
that contains the src tag with a link. I want to gather all the src-links
from classname image svelte-qd248k
.
Answers:
Images data is being retrieved from an API. You can get the info you need like so:
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://search.brave.com/api/images?q=lfc&source=web', headers=headers)
df = pd.DataFrame(r.json()['results'])
print(df)
This will return a dataframe – 150 rows x 7 columns:
title url page_age safe source thumbnail properties
0 Lfc Images Free : 546 Lfc Photos Free Royalty ... https://fanniefrenzel.blogspot.com/2021/04/lfc... 2021-05-07T01:04:00.0000000Z True fanniefrenzel.blogspot.com {'src': 'https://imgs.search.brave.com/GvA-lkD... {'url': 'https://i.pinimg.com/originals/22/90/...
1 [76+] Lfc Wallpaper on WallpaperSafari https://wallpapersafari.com/lfc-wallpaper/ 2021-05-19T00:22:00.0000000Z True wallpapersafari.com {'src': 'https://imgs.search.brave.com/H1oCsoq... {'url': 'https://cdn.wallpapersafari.com/90/14...
2 The LFC Review - YouTube https://www.youtube.com/channel/UChf7tE8oAh4UK... 2020-05-28T10:59:00.0000000Z True YouTube {'src': 'https://imgs.search.brave.com/uuT_1hI... {'url': 'https://yt3.ggpht.com/a/AATXAJx70Gsn7...
This is a complementary answer to Barry the Platipus which also extracts images with pagination using Brave API.
There’s a scrape Brave Images with Python blog post with more detailed info on how to extract all images from the Brave search using pagination.
To scrape Brave images with pagination, you need to use the offset
parameter of the URL, which defaults to 0
for the first page, 151
for the second, and so on. Since data is retrieved from all pages, it is necessary to implement a while
loop:
while True:
# pagination will be here
In each iteration of the loop, you need to make a request to the Brave API, pass the created request parameters and headers. Using the json()
method, the response is converted into a JSON object for further work:
html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()
The new_page_result
list contains all the results on the current page. The new_page_result
list is compared with the old_page_result
list. If they are the same, then this means that we have reached the last page and there is no more new data. Therefore, you need to break
the loop:
new_page_result = html.get('results')
# In the first iteration of the loop, there is no data in the `old_page_result` list. Therefore, the check will fail
if new_page_result == old_page_result:
break
By looping through the new_page_result
list in a for
loop, you can get the data. For each result, data such as title
, link
, source
, width
, height
, and image
are retrieved:
for result in new_page_result:
data.append({
'title': result.get('title'),
'link': result.get('url'),
'source': result.get('source'),
'width': result.get('properties').get('width'),
'height': result.get('properties').get('height'),
'image': result.get('properties').get('url')
})
After extracting the data, you need to increase the value of the offset
parameter by 151
. This value also increases on the site when you click on the button responsible for showing more data, that is, we simulate this behavior:
params['offset'] += 151
Also, make sure you’re using request headers user-agent
to act as a "real" user visit. Because default requests
user-agent
is python-requests
and websites understand that it’s most likely a script that sends a request. Check what’s your user-agent
.
Code and full example in online IDE:
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
'q': 'lfc', # query
'offset': 0 # pagination
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
'content-type': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
brave_images = []
old_page_result = []
while True:
html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()
new_page_result = html.get('results')
if new_page_result == old_page_result:
break
for result in new_page_result:
brave_images.append({
'title': result.get('title'),
'link': result.get('url'),
'source': result.get('source'),
'width': result.get('properties').get('width'),
'height': result.get('properties').get('height'),
'image': result.get('properties').get('url')
})
params['offset'] += 151
old_page_result = new_page_result
print(json.dumps(brave_images, indent=2, ensure_ascii=False))
Output:
[
{
"title": "40 Liverpool FC Facts For You To Walk With Them | Facts.net",
"link": "https://facts.net/lifestyle/sports/liverpool-fc-facts",
"source": "facts.net",
"width": 5849,
"height": 3819,
"image": "https://facts.net/wp-content/uploads/2020/08/Liverpool-Football-Club-logo.jpg"
},
{
"title": "Lfc Wallpaper (58+ images)",
"link": "http://getwallpapers.com/collection/lfc-wallpaper",
"source": "getwallpapers.com",
"width": 1080,
"height": 1920,
"image": "http://getwallpapers.com/wallpaper/full/a/d/d/1114818-most-popular-lfc-wallpaper-1080x1920-desktop.jpg"
},
{
"title": "Lfc Images Free : 546 Lfc Photos Free Royalty Free Stock Photos From Dreamstime - Message us ...",
"link": "https://fanniefrenzel.blogspot.com/2021/04/lfc-images-free-546-lfc-photos-free.html",
"source": "fanniefrenzel.blogspot.com",
"width": 1024,
"height": 768,
"image": "https://i.pinimg.com/originals/22/90/59/229059d7b1ce5bc9f1a7e7c5aa25be1d.jpg"
},
{
"title": "LFC Wallpaper Download | MagOne 2016",
"link": "https://wallpapercarax.blogspot.com/2019/04/lfc-wallpaper-download.html",
"source": "blogspot.com",
"width": 1024,
"height": 768,
"image": "https://4.bp.blogspot.com/-G-UCe0A1ZdI/XL1RVPcZ_bI/AAAAAAAACsY/-k_Dy7WKtjooOLWrHebK42ynwvkqQM_8ACEwYBhgL/s1600/lfc-wallpaper-download-06.jpg"
},
{
"title": "Report claims Liverpool FC talks ongoing over investment from China Everbright - Liverpool FC ...",
"link": "https://www.thisisanfield.com/2016/08/report-claims-liverpool-fc-talks-ongoing-investment-chinese-everbright/",
"source": "This Is Anfield",
"width": 1200,
"height": 842,
"image": "https://www.thisisanfield.com/wp-content/uploads/PROP150218-018-Liverpool_Press_Conf.jpg"
},
... other images
]
I’m trying to get a list of the src values and the source code from a https://search.brave.com/images?q= image search. I don’t really know the problem, because the code works on other sites. Below can you see the code and the html tag that I’m trying to webscrape.
url = "https://search.brave.com/images?q=lfc"
r = requests.get(url)
content = r.content
soup = BeautifulSoup(content, "html.parser")
print("n 1) Insert into .txtn")
fp = urllib.request.urlopen(url)
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
with open("txt.txt", "w") as textFile:
textFile.write(mystr)
print("n 2) Check if src == true n")
with open('txt.txt') as f:
if 'src' in f.read():
print(" 2) True n")
print(" 3) Find All Img")
anchors = soup.find_all('img')
all_links = set()
with open("imgUrls.txt", "w") as textFile_1:
for link in anchors:
if(link.get('src') != '#'):
linkText = url+str(link.get('src'))
all_links.add(link)
print(linkText)
textFile_1.writelines(linkText+'n')
Below is the tag section in Brave browser, it is the img
tag with classname : image svelte-qd248k
that contains the src tag with a link. I want to gather all the src-links
from classname image svelte-qd248k
.
Images data is being retrieved from an API. You can get the info you need like so:
import requests
import pandas as pd
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}
r = requests.get('https://search.brave.com/api/images?q=lfc&source=web', headers=headers)
df = pd.DataFrame(r.json()['results'])
print(df)
This will return a dataframe – 150 rows x 7 columns:
title url page_age safe source thumbnail properties
0 Lfc Images Free : 546 Lfc Photos Free Royalty ... https://fanniefrenzel.blogspot.com/2021/04/lfc... 2021-05-07T01:04:00.0000000Z True fanniefrenzel.blogspot.com {'src': 'https://imgs.search.brave.com/GvA-lkD... {'url': 'https://i.pinimg.com/originals/22/90/...
1 [76+] Lfc Wallpaper on WallpaperSafari https://wallpapersafari.com/lfc-wallpaper/ 2021-05-19T00:22:00.0000000Z True wallpapersafari.com {'src': 'https://imgs.search.brave.com/H1oCsoq... {'url': 'https://cdn.wallpapersafari.com/90/14...
2 The LFC Review - YouTube https://www.youtube.com/channel/UChf7tE8oAh4UK... 2020-05-28T10:59:00.0000000Z True YouTube {'src': 'https://imgs.search.brave.com/uuT_1hI... {'url': 'https://yt3.ggpht.com/a/AATXAJx70Gsn7...
This is a complementary answer to Barry the Platipus which also extracts images with pagination using Brave API.
There’s a scrape Brave Images with Python blog post with more detailed info on how to extract all images from the Brave search using pagination.
To scrape Brave images with pagination, you need to use the offset
parameter of the URL, which defaults to 0
for the first page, 151
for the second, and so on. Since data is retrieved from all pages, it is necessary to implement a while
loop:
while True:
# pagination will be here
In each iteration of the loop, you need to make a request to the Brave API, pass the created request parameters and headers. Using the json()
method, the response is converted into a JSON object for further work:
html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()
The new_page_result
list contains all the results on the current page. The new_page_result
list is compared with the old_page_result
list. If they are the same, then this means that we have reached the last page and there is no more new data. Therefore, you need to break
the loop:
new_page_result = html.get('results')
# In the first iteration of the loop, there is no data in the `old_page_result` list. Therefore, the check will fail
if new_page_result == old_page_result:
break
By looping through the new_page_result
list in a for
loop, you can get the data. For each result, data such as title
, link
, source
, width
, height
, and image
are retrieved:
for result in new_page_result:
data.append({
'title': result.get('title'),
'link': result.get('url'),
'source': result.get('source'),
'width': result.get('properties').get('width'),
'height': result.get('properties').get('height'),
'image': result.get('properties').get('url')
})
After extracting the data, you need to increase the value of the offset
parameter by 151
. This value also increases on the site when you click on the button responsible for showing more data, that is, we simulate this behavior:
params['offset'] += 151
Also, make sure you’re using request headers user-agent
to act as a "real" user visit. Because default requests
user-agent
is python-requests
and websites understand that it’s most likely a script that sends a request. Check what’s your user-agent
.
Code and full example in online IDE:
import requests, json
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
'q': 'lfc', # query
'offset': 0 # pagination
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
'content-type': 'application/json',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
brave_images = []
old_page_result = []
while True:
html = requests.get('https://search.brave.com/api/images', headers=headers, params=params).json()
new_page_result = html.get('results')
if new_page_result == old_page_result:
break
for result in new_page_result:
brave_images.append({
'title': result.get('title'),
'link': result.get('url'),
'source': result.get('source'),
'width': result.get('properties').get('width'),
'height': result.get('properties').get('height'),
'image': result.get('properties').get('url')
})
params['offset'] += 151
old_page_result = new_page_result
print(json.dumps(brave_images, indent=2, ensure_ascii=False))
Output:
[
{
"title": "40 Liverpool FC Facts For You To Walk With Them | Facts.net",
"link": "https://facts.net/lifestyle/sports/liverpool-fc-facts",
"source": "facts.net",
"width": 5849,
"height": 3819,
"image": "https://facts.net/wp-content/uploads/2020/08/Liverpool-Football-Club-logo.jpg"
},
{
"title": "Lfc Wallpaper (58+ images)",
"link": "http://getwallpapers.com/collection/lfc-wallpaper",
"source": "getwallpapers.com",
"width": 1080,
"height": 1920,
"image": "http://getwallpapers.com/wallpaper/full/a/d/d/1114818-most-popular-lfc-wallpaper-1080x1920-desktop.jpg"
},
{
"title": "Lfc Images Free : 546 Lfc Photos Free Royalty Free Stock Photos From Dreamstime - Message us ...",
"link": "https://fanniefrenzel.blogspot.com/2021/04/lfc-images-free-546-lfc-photos-free.html",
"source": "fanniefrenzel.blogspot.com",
"width": 1024,
"height": 768,
"image": "https://i.pinimg.com/originals/22/90/59/229059d7b1ce5bc9f1a7e7c5aa25be1d.jpg"
},
{
"title": "LFC Wallpaper Download | MagOne 2016",
"link": "https://wallpapercarax.blogspot.com/2019/04/lfc-wallpaper-download.html",
"source": "blogspot.com",
"width": 1024,
"height": 768,
"image": "https://4.bp.blogspot.com/-G-UCe0A1ZdI/XL1RVPcZ_bI/AAAAAAAACsY/-k_Dy7WKtjooOLWrHebK42ynwvkqQM_8ACEwYBhgL/s1600/lfc-wallpaper-download-06.jpg"
},
{
"title": "Report claims Liverpool FC talks ongoing over investment from China Everbright - Liverpool FC ...",
"link": "https://www.thisisanfield.com/2016/08/report-claims-liverpool-fc-talks-ongoing-investment-chinese-everbright/",
"source": "This Is Anfield",
"width": 1200,
"height": 842,
"image": "https://www.thisisanfield.com/wp-content/uploads/PROP150218-018-Liverpool_Press_Conf.jpg"
},
... other images
]