How to get the value of src attribute from Gibiru image-search via Selenium

Question:

I’m working on a webscraping program to collect src links from every image search on https://gibiru.com/

driver.get("https://gibiru.com/")
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').click()
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').send_keys("lfc")
driver.find_element_by_css_selector('.form-control.has-feedback.has-clear').send_keys(Keys.RETURN)
driver.find_element(By.XPATH, "/html/body/div[1]/main/div[1]/div/div/div/div[2]").click()
test = driver.find_element(By.XPATH, "//*[@id='___gcse_0']/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img")
print(str(test))

This is the paths to the image:

Element:

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMABCfx3q7rIc6AqY0WSu84w22-PUbEnxkEDqmPqTqNYLrqr0&amp;s" title="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" alt="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" class="gs-image gs-image-scalable">

outerHTML:

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRMABCfx3q7rIc6AqY0WSu84w22-PUbEnxkEDqmPqTqNYLrqr0&amp;s" title="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" alt="Diogo Jota vs Tekkz &amp; Stingrayjnr | 'LFC ePL All-Star Game' - YouTube" class="gs-image gs-image-scalable">

Selector:

#___gcse_0 > div > div > div > div.gsc-wrapper > div.gsc-resultsbox-visible > div.gsc-resultsRoot.gsc-tabData.gsc-tabdActive > div > div.gsc-expansionArea > div:nth-child(1) > div.gs-result.gs-imageResult.gs-imageResult-popup > div.gs-image-thumbnail-box > div > a > img 

JS_path:

document.querySelector("#___gcse_0 > div > div > div > div.gsc-wrapper > div.gsc-resultsbox-visible > div.gsc-resultsRoot.gsc-tabData.gsc-tabdActive > div > div.gsc-expansionArea > div:nth-child(1) > div.gs-result.gs-imageResult.gs-imageResult-popup > div.gs-image-thumbnail-box > div > a > img")

Xpath:

//*[@id="___gcse_0"]/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img

Full_Xpath:

/html/body/div[1]/main/div[2]/div[2]/div/div[1]/div/div/div/div/div[5]/div[2]/div[2]/div/div[1]/div[1]/div[1]/div[1]/div/a/img

This is the tag I want to read the value of src attribute. My error code says that the test element does not exists.

[HTML tag1

Asked By: AnxiousLuna

||

Answers:

To print the value of the src attribute you need to induce WebDriverWait for the visibility_of_element_located() and you can use either of the following locator strategies:

  • Using CSS_SELECTOR:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "img.gs-image.gs-image-scalable[alt^='Diogo Jota vs Tekkz'][title*='YouTube']"))).get_attribute("src"))
    
  • Using XPATH:

    print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//img[@class='gs-image gs-image-scalable' and starts-with(@alt, 'Diogo Jota vs Tekkz')][contains(@title, 'YouTube')]"))).get_attribute("src"))
    
  • Note : You have to add the following imports :

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

You can find a relevant discussion in Python Selenium – get href value

Answered By: undetected Selenium

It can be done with requests, gibiru is pulling its results from google:

import requests
import pandas as pd
import json

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.60 Safari/537.17'}

r = requests.get('https://cse.google.com/cse/element/v1?rsz=20&num=20&hl=en&source=gcsc&gss=.com&cselibv=3e1664f444e6eb06&searchtype=image&cx=partner-pub-5956360965567042:9380749580&q=lfc&safe=off&cse_tok=AB1-RNUhB3siCjwzYPYzrx4PNVWU:1658589428907&exp=csqr,cc&callback=google.search.cse.api5143', headers=headers)
df = pd.DataFrame(json.loads(r.text.split('google.search.cse.api5143(')[1].rsplit(');', 1)[0])['results'])
#print(df)

Alternatively, URLs can be filtered out of the JSON object:

obj = json.loads(r.text.split('google.search.cse.api5143(')[1].rsplit(');', 1)[0].replace('n', ''))['results']
for x in obj:
    print(x['url'])

This will print out full URLs:

https://i.ytimg.com/vi/5KlWCboXwLc/maxresdefault.jpg
https://m.media-amazon.com/images/I/71KhvpBQtIL._AC_SL1200_.jpg
https://i.ytimg.com/vi/P4w1-oVWb3U/maxresdefault.jpg
https://anfieldindex.com/wp-content/uploads/Fa-Cup-final-pr.png
https://i.ytimg.com/vi/7lsn7Wk5Xpg/maxresdefault.jpg
[....]

Inspect the headers for get request when switching to page 2, 3 etc, to adapt your code to scraping the rest of the pages as well.