Python / Selenium. How to find relationship between data?

Question

I’m creating an Amazon web-scraper which just returns the name and price of all products on the search results. Will filter through a dictionary of strings (products) and collect the titles and pricing for all results. I’m doing this to calculate the average / mean of a products pricing and also to find the highest and lowest prices for that product found on Amazon.

So making the scraper was easy enough. Here’s a snippet so you understand the code I am using.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Key

driver = webdriver.Chrome()
driver.get("https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1")

# retrieving item titles
shoes = driver.find_elements(By.XPATH, '//span[@class="a-size-base-plus a-color-base a-text-normal"]')
shoes_list = []
for s in range(len(shoes)):
    shoes_list.append(shoes[s].text)

# retrieving prices
price = driver.find_elements(By.XPATH, '//span[@class="a-price"]')
price_list = []
for p in range(len(price)):
    price_list.append(price[p].text)

# prices are retuned with a newline instead of a decimal
# example: £9n99 instead of £9.99
# so fixing that

temp_price_list = []
for price in price_list:
    price = price.replace("n", ".")
    temp_price_list.append(price)
price_list = temp_price_list

So here’s the issue. Almost without fail, Amazon have a handful of the products with no price? This really messes with things. Because once I’ve sorted out the data into a dataframe

title_and_price = list(zip(shoes_list[0:],price_list[0:]))
df = DataFrame(title_and_price, columns=['Product','Price'])

At some point the data gets mixed up and the price will be sorted next to the wrong product. I have left screenshots below for you to see.

Missing prices on Amazon site
Incorrect data

Unfortunately, when pulling the price data, it does not pull in a ‘blank’ set of data if it’s blank, which if it did I wouldn’t need to be asking for help as I could just display a blank price next to the item and everything would still remain in order.

Is there anyway to alter the code that it would be able to detect a non-displayed price and therefore keep all the data in order? The data stays in order right up until there’s a product with no price, which in every single case of an Amazon search, there is. Really appreciate any insight on this.

Asked By: cap1hunna

||

Source

Answer 1

To make sure price is married to shoe name, you should locate the parent element of both shoe name and price, and add them as a tuple to a list (which is to become a dataframe), like in example below:

 from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")


webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

df_list = []
url = 'https://www.amazon.co.uk/s?k=nike+shoes&crid=25W2RSXZBGPX3&sprefix=nike+shoe%2Caps%2C105&ref=nb_sb_noss_1'
browser.get(url)

shoes = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".s-result-item")))
for shoe in shoes:
#     print(shoe.get_attribute('outerHTML'))
    try:
        shoe_title = shoe.find_element(By.CSS_SELECTOR, ".a-text-normal")
    except Exception as e:
        continue
    try:
        shoe_price = shoe.find_element(By.CSS_SELECTOR, 'span[class="a-price"]')
    except Exception as e:
        continue
    df_list.append((shoe_title.text.strip(), shoe_price.text.strip()))
df = pd.DataFrame(df_list, columns = ['Shoe', 'Price'])
print(df)

This would return (depending on Amazon’s appetite for serving ads in html tags similar to products):

Shoe    Price
0   Nike NIKE AIR MAX MOTION 2, Men's Running Shoe...   £79n99
1   Nike Air Max 270 React Se GS Running Trainers ...   £69n99
2   NIKE Women's Air Max Low Sneakers   £69n99
3   NIKE Men's React Miler Running Shoes    £109n99
4   NIKE Men's Revolution 5 Flyease Running Shoe    £38n70
5   NIKE Women's W Revolution 6 Nn Running Shoe £48n00
6   NIKE Men's Downshifter 10 Running Shoe  £54n99
7   NIKE Women's Court Vision Low Better Basketbal...   £30n00
8   NIKE Team Hustle D 10 Gs Gymnastics Shoe, Blac...   £20n72
9   NIKE Men's Air Max Wright Gs Running Shoe   £68n51
10  NIKE Men's Air Max Sc Trainers  £54n99
11  NIKE Pegasus Trail 3 Gore-TEX Men's Waterproof...   £134n95
12  NIKE Women's W Superrep Go 2 Sneakers   £54n00
13  NIKE Boys Tanjun Running Shoes  £35n53
14  NIKE Women's Air Max Bella Tr 4 Gymnastics Sho...   £28n00
15  NIKE Men's Defy All Day Gymnastics Shoe £54n95
16  NIKE Men's Venture Runner Sneaker   £45n90
17  Nike Nike Court Borough Low 2 (gs), Boy's Bask...   £24n00
18  NIKE Men's Court Royale 2 Better Essential Tra...   £25n81
19  NIKE Men's Quest 4 Running Shoe £38n00
20  Women Trainers Running Shoes - Air Cushion Sne...   £35n69
21  Men Women Walking Trainers Light Running Breat...   £42n99
22  JSLEAP Mens Running Shoes Fashion Non Slip Ath...   £44n99
[...]

You should pay attention to a couple of things:

I am waiting for the element to load in page, then try to locate it, see the imports (Webdriverwait etc)
Your results may vary, depending on your advertising profile
You can select more details for each item, use ddifferent css/xpath/etc selectors, this is meant to give you a headstart only

Answered By: platipus_on_fire

Python / Selenium. How to find relationship between data?

Question:

Answers: