Python: Webcraper Pandas Dataframe Returning Multiple Empty Rows in Between Data

Question

So I’m building an eBay webscraper for work (I should note that I am incredibly new to programming in general, and am entirely self-taught using the internet), and I have made it functionin. I am building this with Python 3.11, in a Jupyter Notebook within Azure Data Studio. However, it returns in the csv (and consequently the Excel sheet) with multiple empty rows:

name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
etc. . .

Here is my code:

import time

import requests
import pandas
import lxml
import selenium
import html5lib

import pandas as pd
from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

options = webdriver.ChromeOptions()
options.headless = True 
options.page_load_strategy = 'none'
chrome_path = ChromeDriverManager().install()

s = Service(chrome_path)
driver = Chrome(options=options, service=s) # headers=headers once I can get it working again

driver.implicitly_wait(5)
browser = webdriver.Chrome(service=s)

# searchkey = input() <-- this commented out portion is for when I have got it more functional so that I can do a more dynamic url
# url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240'
url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240'

data = []

browser.get(url)
time.sleep(10)

content = browser.find_element(By.CSS_SELECTOR, "div[class*='srp-river-results']")
item_contents = content.find_elements(By.TAG_NAME, "li")

def extract_data(content):
    name = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__title']>span")
    if name:
        name = [attr.text for attr in name]
    else:
        name = None
    
    condition = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__subtitle']>span")
    if condition:
        condition = [attr.text for attr in condition]
    else:
        condition = None

    price = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__price']")
    if price:
        price = price[0].text
    else:
        price = None

    purchase_options = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__purchaseOptionsWithIcon']")
    if purchase_options:
        purchase_options = [attr.text for attr in purchase_options]
    else:
        purchase_options = None

    shipping = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__logisticsCost']")
    if shipping:
        shipping = [attr.text for attr in shipping]
    else:
        shipping = None
    
    return {
        "name": name,
        "condition": condition,
        "price": price,
        "options": purchase_options,
        "shipping": shipping
    }

for content in item_contents:
    extracted_data = extract_data(content)
    data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)

Now, looking into the HTML with the Inspect tool, I discovered what I think the problem is. As I am using just the "li" tag in the "item_contents" variable, it seems to be attempting to pull the data sets for the river/carousel at the top (which is in the same div class and is stored in a "li" element), and then within each item card there is a potential for a "Top Rated" status, whose element includes 3 additional "li" elements.

The problem is, I don’t actually know how to fix this? I attempted to adjust the tag selector to include the "data-viewport" bit, but that didn’t seem to work in either By.CSS_SELECTOR or By.TAG_NAME, like so:

item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")

giving me entirely blank dataframes instead. I’ve tried searching how to better select my CSS elements, but I am struggling to get what I want, or at least the answers I’ve found seem to be geared towards different problems than mine. Using dropna works to just clear out those empty rows, but I feel like there must be a better way for me to select my tags or something so that I don’t end up with data like this? If there isn’t, though, I can just continue like that. Just wanting to learn how to better program, I suppose. Any assistance would be great! Thanks in advance!

Asked By: Flintzer0

||

Source

Answer 1

Change your selection strategy and use dict instead of several lists:

for content in browser.find_elements(By.CSS_SELECTOR, ".srp-results li.s-item"):
    data.append({
        'name' : content.find_element(By.CSS_SELECTOR, "div.s-item__title > span").text,
        'condition' : content.find_element(By.CSS_SELECTOR, "div.s-item__subtitle > span").text,
        'price' : content.find_element(By.CSS_SELECTOR, "span.s-item__price").text,
        'purchase_options' : content.find_element(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon")) > 0 else None,
        'shipping' : content.find_element(By.CSS_SELECTOR, "span.s-item__logisticsCost").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__logisticsCost")) else None
    })

But it do not need selenium overhead, simply use requests:

import requests
import pandas as pd
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text)
data = []

for e in soup.select('.srp-results li.s-item'):
    data.append({
        'name' : e.select_one('div.s-item__title > span').text,
        'condition' : e.select_one('div.s-item__subtitle > span').text,
        'price' : e.select_one('span.s-item__price').text,
        'purchase_options' : e.select_one('span.s-item__purchaseOptionsWithIcon').text if  e.select_one('span.s-item__purchaseOptionsWithIcon') else None,
        'shipping' : e.select_one('span.s-item__logisticsCost').text if e.select_one('span.s-item__logisticsCost') else None
    })

pd.DataFrame(data)

Output

	name	condition	price	purchase_options	shipping
0	Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB – Good	Good – Refurbished	$139.99 to $199.99	Buy It Now	+$19.40 shipping
1	Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB – Wi-Fi + Cellular – Good	Good – Refurbished	$149.00 to $199.00	Buy It Now	Shipping not specified
2	Apple iPad 5 – 5th Gen 2017 Model 9.7" – 32GB 128GB Wi-Fi – Cellular – Good	Good – Refurbished	$118.99	Buy It Now	+$19.09 shipping
3	Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 – Good	Good – Refurbished	$89.99	Buy It Now	+$18.65 shipping
4	2021 Apple iPad 9th Gen 64/256GB WiFi 10.2"	Brand New	$335.00 to $485.00	Buy It Now	+$34.87 shipping estimate
…
250	2022 APPLE iPAD AIR 5TH GEN 10.9" 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588	Brand New	$650.00	or Best Offer	+$21.45 shipping
251	Apple iPad 2 16GB, Wi-Fi, 9.7in – Black 7 pack	Pre-Owned	$17.50		+$48.63 shipping estimate
252	Apple iPad Air 4 (4th Gen) (10.9 inch) – 64GB – 256GB Wi-Fi + Cellular – Good	Good – Refurbished	$439.00 to $549.00	Buy It Now	+$40.14 shipping estimate
253	Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good)	Very Good – Refurbished	$149.99	Buy It Now	+$19.55 shipping
254	Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box	Pre-Owned	$249.00	Buy It Now	+$29.72 shipping estimate

Answered By: HedgeHog

Answer 2

Based on HedgeHog answer.

What I can highly recommend is using xpath and lxml library to parse html instead of BeautifulSoup, as it is much faster.

import requests
import pandas as pd
from lxml import etree

response_text = requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text

root = etree.HTML(response_text)

items = root.xpath(".//ul[@class='srp-results srp-list clearfix']/li[@class='s-item s-item__pl-on-bottom']")
data = []
for item in items:
        data.append({
        "name": item.xpath(".//div[@class='s-item__title']//text()")[0],
        "condition": item.xpath(".//div[@class='s-item__subtitle']/span/text()")[0],
        "price": "".join(item.xpath(".//span[@class='s-item__price']//text()")),
        "purchase_options": "".join(item.xpath(".//span[@class='s-item__dynamic s-item__purchaseOptionsWithIcon']//text()")),
        "shipping": "".join(item.xpath(".//span[@class='s-item__shipping s-item__logisticsCost']//text()"))
    })

df = pd.DataFrame(data)

Comparison betwean

Answered By: puchal

Python: Webcraper Pandas Dataframe Returning Multiple Empty Rows in Between Data

Question:

Answers:

Output