Python: Webcraper Pandas Dataframe Returning Multiple Empty Rows in Between Data
Question:
So I’m building an eBay webscraper for work (I should note that I am incredibly new to programming in general, and am entirely self-taught using the internet), and I have made it functionin. I am building this with Python 3.11, in a Jupyter Notebook within Azure Data Studio. However, it returns in the csv (and consequently the Excel sheet) with multiple empty rows:
name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
etc. . .
Here is my code:
import time
import requests
import pandas
import lxml
import selenium
import html5lib
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.headless = True
options.page_load_strategy = 'none'
chrome_path = ChromeDriverManager().install()
s = Service(chrome_path)
driver = Chrome(options=options, service=s) # headers=headers once I can get it working again
driver.implicitly_wait(5)
browser = webdriver.Chrome(service=s)
# searchkey = input() <-- this commented out portion is for when I have got it more functional so that I can do a more dynamic url
# url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240'
url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240'
data = []
browser.get(url)
time.sleep(10)
content = browser.find_element(By.CSS_SELECTOR, "div[class*='srp-river-results']")
item_contents = content.find_elements(By.TAG_NAME, "li")
def extract_data(content):
name = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__title']>span")
if name:
name = [attr.text for attr in name]
else:
name = None
condition = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__subtitle']>span")
if condition:
condition = [attr.text for attr in condition]
else:
condition = None
price = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__price']")
if price:
price = price[0].text
else:
price = None
purchase_options = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__purchaseOptionsWithIcon']")
if purchase_options:
purchase_options = [attr.text for attr in purchase_options]
else:
purchase_options = None
shipping = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__logisticsCost']")
if shipping:
shipping = [attr.text for attr in shipping]
else:
shipping = None
return {
"name": name,
"condition": condition,
"price": price,
"options": purchase_options,
"shipping": shipping
}
for content in item_contents:
extracted_data = extract_data(content)
data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)
Now, looking into the HTML with the Inspect tool, I discovered what I think the problem is. As I am using just the "li" tag in the "item_contents" variable, it seems to be attempting to pull the data sets for the river/carousel at the top (which is in the same div class and is stored in a "li" element), and then within each item card there is a potential for a "Top Rated" status, whose element includes 3 additional "li" elements.
The problem is, I don’t actually know how to fix this? I attempted to adjust the tag selector to include the "data-viewport" bit, but that didn’t seem to work in either By.CSS_SELECTOR or By.TAG_NAME, like so:
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")
giving me entirely blank dataframes instead. I’ve tried searching how to better select my CSS elements, but I am struggling to get what I want, or at least the answers I’ve found seem to be geared towards different problems than mine. Using dropna works to just clear out those empty rows, but I feel like there must be a better way for me to select my tags or something so that I don’t end up with data like this? If there isn’t, though, I can just continue like that. Just wanting to learn how to better program, I suppose. Any assistance would be great! Thanks in advance!
Answers:
Change your selection strategy and use dict
instead of several lists
:
for content in browser.find_elements(By.CSS_SELECTOR, ".srp-results li.s-item"):
data.append({
'name' : content.find_element(By.CSS_SELECTOR, "div.s-item__title > span").text,
'condition' : content.find_element(By.CSS_SELECTOR, "div.s-item__subtitle > span").text,
'price' : content.find_element(By.CSS_SELECTOR, "span.s-item__price").text,
'purchase_options' : content.find_element(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon")) > 0 else None,
'shipping' : content.find_element(By.CSS_SELECTOR, "span.s-item__logisticsCost").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__logisticsCost")) else None
})
But it do not need selenium
overhead, simply use requests
:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text)
data = []
for e in soup.select('.srp-results li.s-item'):
data.append({
'name' : e.select_one('div.s-item__title > span').text,
'condition' : e.select_one('div.s-item__subtitle > span').text,
'price' : e.select_one('span.s-item__price').text,
'purchase_options' : e.select_one('span.s-item__purchaseOptionsWithIcon').text if e.select_one('span.s-item__purchaseOptionsWithIcon') else None,
'shipping' : e.select_one('span.s-item__logisticsCost').text if e.select_one('span.s-item__logisticsCost') else None
})
pd.DataFrame(data)
Output
name
condition
price
purchase_options
shipping
0
Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB – Good
Good – Refurbished
$139.99 to $199.99
Buy It Now
+$19.40 shipping
1
Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB – Wi-Fi + Cellular – Good
Good – Refurbished
$149.00 to $199.00
Buy It Now
Shipping not specified
2
Apple iPad 5 – 5th Gen 2017 Model 9.7" – 32GB 128GB Wi-Fi – Cellular – Good
Good – Refurbished
$118.99
Buy It Now
+$19.09 shipping
3
Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 – Good
Good – Refurbished
$89.99
Buy It Now
+$18.65 shipping
4
2021 Apple iPad 9th Gen 64/256GB WiFi 10.2"
Brand New
$335.00 to $485.00
Buy It Now
+$34.87 shipping estimate
…
250
2022 APPLE iPAD AIR 5TH GEN 10.9" 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588
Brand New
$650.00
or Best Offer
+$21.45 shipping
251
Apple iPad 2 16GB, Wi-Fi, 9.7in – Black 7 pack
Pre-Owned
$17.50
+$48.63 shipping estimate
252
Apple iPad Air 4 (4th Gen) (10.9 inch) – 64GB – 256GB Wi-Fi + Cellular – Good
Good – Refurbished
$439.00 to $549.00
Buy It Now
+$40.14 shipping estimate
253
Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good)
Very Good – Refurbished
$149.99
Buy It Now
+$19.55 shipping
254
Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box
Pre-Owned
$249.00
Buy It Now
+$29.72 shipping estimate
Based on HedgeHog answer.
What I can highly recommend is using xpath and lxml library to parse html instead of BeautifulSoup, as it is much faster.
import requests
import pandas as pd
from lxml import etree
response_text = requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text
root = etree.HTML(response_text)
items = root.xpath(".//ul[@class='srp-results srp-list clearfix']/li[@class='s-item s-item__pl-on-bottom']")
data = []
for item in items:
data.append({
"name": item.xpath(".//div[@class='s-item__title']//text()")[0],
"condition": item.xpath(".//div[@class='s-item__subtitle']/span/text()")[0],
"price": "".join(item.xpath(".//span[@class='s-item__price']//text()")),
"purchase_options": "".join(item.xpath(".//span[@class='s-item__dynamic s-item__purchaseOptionsWithIcon']//text()")),
"shipping": "".join(item.xpath(".//span[@class='s-item__shipping s-item__logisticsCost']//text()"))
})
df = pd.DataFrame(data)
Comparison betwean
So I’m building an eBay webscraper for work (I should note that I am incredibly new to programming in general, and am entirely self-taught using the internet), and I have made it functionin. I am building this with Python 3.11, in a Jupyter Notebook within Azure Data Studio. However, it returns in the csv (and consequently the Excel sheet) with multiple empty rows:
name,condition,price,options,shipping
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
,,,,
['Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB - Wi-Fi + Cellular - Good'],['Good - Refurbished'],$149.00 to $199.00,['Buy It Now'],
,,,,
,,,,
,,,,
['Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB - Good'],['Good - Refurbished'],$139.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
['Apple iPad 2nd 3rd 4th Generation 16GB 32GB 64GB 128GB PICK:GB - Color *Grade B*'],['Pre-Owned'],$64.99 to $199.99,['Buy It Now'],['Free shipping']
,,,,
,,,,
,,,,
etc. . .
Here is my code:
import time
import requests
import pandas
import lxml
import selenium
import html5lib
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
options = webdriver.ChromeOptions()
options.headless = True
options.page_load_strategy = 'none'
chrome_path = ChromeDriverManager().install()
s = Service(chrome_path)
driver = Chrome(options=options, service=s) # headers=headers once I can get it working again
driver.implicitly_wait(5)
browser = webdriver.Chrome(service=s)
# searchkey = input() <-- this commented out portion is for when I have got it more functional so that I can do a more dynamic url
# url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240'
url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240'
data = []
browser.get(url)
time.sleep(10)
content = browser.find_element(By.CSS_SELECTOR, "div[class*='srp-river-results']")
item_contents = content.find_elements(By.TAG_NAME, "li")
def extract_data(content):
name = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__title']>span")
if name:
name = [attr.text for attr in name]
else:
name = None
condition = content.find_elements(By.CSS_SELECTOR, "div[class*='s-item__subtitle']>span")
if condition:
condition = [attr.text for attr in condition]
else:
condition = None
price = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__price']")
if price:
price = price[0].text
else:
price = None
purchase_options = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__purchaseOptionsWithIcon']")
if purchase_options:
purchase_options = [attr.text for attr in purchase_options]
else:
purchase_options = None
shipping = content.find_elements(By.CSS_SELECTOR, "span[class*='s-item__logisticsCost']")
if shipping:
shipping = [attr.text for attr in shipping]
else:
shipping = None
return {
"name": name,
"condition": condition,
"price": price,
"options": purchase_options,
"shipping": shipping
}
for content in item_contents:
extracted_data = extract_data(content)
data.append(extracted_data)
df = pd.DataFrame(data)
df.to_csv("frame.csv", index=False)
Now, looking into the HTML with the Inspect tool, I discovered what I think the problem is. As I am using just the "li" tag in the "item_contents" variable, it seems to be attempting to pull the data sets for the river/carousel at the top (which is in the same div class and is stored in a "li" element), and then within each item card there is a potential for a "Top Rated" status, whose element includes 3 additional "li" elements.
The problem is, I don’t actually know how to fix this? I attempted to adjust the tag selector to include the "data-viewport" bit, but that didn’t seem to work in either By.CSS_SELECTOR or By.TAG_NAME, like so:
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport]")
item_contents = content.find_elements(By.TAG_NAME, "li[data-viewport*='trackableId']")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport]")
item_contents = content.find_elements(By.CSS_SELECTOR, "li[data-viewport*='trackableId']")
giving me entirely blank dataframes instead. I’ve tried searching how to better select my CSS elements, but I am struggling to get what I want, or at least the answers I’ve found seem to be geared towards different problems than mine. Using dropna works to just clear out those empty rows, but I feel like there must be a better way for me to select my tags or something so that I don’t end up with data like this? If there isn’t, though, I can just continue like that. Just wanting to learn how to better program, I suppose. Any assistance would be great! Thanks in advance!
Change your selection strategy and use dict
instead of several lists
:
for content in browser.find_elements(By.CSS_SELECTOR, ".srp-results li.s-item"):
data.append({
'name' : content.find_element(By.CSS_SELECTOR, "div.s-item__title > span").text,
'condition' : content.find_element(By.CSS_SELECTOR, "div.s-item__subtitle > span").text,
'price' : content.find_element(By.CSS_SELECTOR, "span.s-item__price").text,
'purchase_options' : content.find_element(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__purchaseOptionsWithIcon")) > 0 else None,
'shipping' : content.find_element(By.CSS_SELECTOR, "span.s-item__logisticsCost").text if len(content.find_elements(By.CSS_SELECTOR, "span.s-item__logisticsCost")) else None
})
But it do not need selenium
overhead, simply use requests
:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text)
data = []
for e in soup.select('.srp-results li.s-item'):
data.append({
'name' : e.select_one('div.s-item__title > span').text,
'condition' : e.select_one('div.s-item__subtitle > span').text,
'price' : e.select_one('span.s-item__price').text,
'purchase_options' : e.select_one('span.s-item__purchaseOptionsWithIcon').text if e.select_one('span.s-item__purchaseOptionsWithIcon') else None,
'shipping' : e.select_one('span.s-item__logisticsCost').text if e.select_one('span.s-item__logisticsCost') else None
})
pd.DataFrame(data)
Output
name | condition | price | purchase_options | shipping | |
---|---|---|---|---|---|
0 | Apple iPad Air 2 2nd WiFi + Cellular Unlocked 16GB 32GB 64GB 128GB – Good | Good – Refurbished | $139.99 to $199.99 | Buy It Now | +$19.40 shipping |
1 | Apple iPad 5 (5th Gen -2017 Model) -32GB -128GB – Wi-Fi + Cellular – Good | Good – Refurbished | $149.00 to $199.00 | Buy It Now | Shipping not specified |
2 | Apple iPad 5 – 5th Gen 2017 Model 9.7" – 32GB 128GB Wi-Fi – Cellular – Good | Good – Refurbished | $118.99 | Buy It Now | +$19.09 shipping |
3 | Apple iPad Air 1st Gen A1474 32GB Wi-Fi 9.7in Tablet Space Gray iOS 12 – Good | Good – Refurbished | $89.99 | Buy It Now | +$18.65 shipping |
4 | 2021 Apple iPad 9th Gen 64/256GB WiFi 10.2" | Brand New | $335.00 to $485.00 | Buy It Now | +$34.87 shipping estimate |
… | |||||
250 | 2022 APPLE iPAD AIR 5TH GEN 10.9" 256GB STARLIGHT WI-FI TABLET MM9P3LL/A A2588 | Brand New | $650.00 | or Best Offer | +$21.45 shipping |
251 | Apple iPad 2 16GB, Wi-Fi, 9.7in – Black 7 pack | Pre-Owned | $17.50 | +$48.63 shipping estimate | |
252 | Apple iPad Air 4 (4th Gen) (10.9 inch) – 64GB – 256GB Wi-Fi + Cellular – Good | Good – Refurbished | $439.00 to $549.00 | Buy It Now | +$40.14 shipping estimate |
253 | Apple iPad Air 2 A1567 (WiFi + Cellular Unlocked) 64GB Space Gray (Very Good) | Very Good – Refurbished | $149.99 | Buy It Now | +$19.55 shipping |
254 | Apple iPad Pro, Bundle, 10.5-inch, 64GB, Space Gray, Wi-Fi Only, Original Box | Pre-Owned | $249.00 | Buy It Now | +$29.72 shipping estimate |
Based on HedgeHog answer.
What I can highly recommend is using xpath and lxml library to parse html instead of BeautifulSoup, as it is much faster.
import requests
import pandas as pd
from lxml import etree
response_text = requests.get('https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=240').text
root = etree.HTML(response_text)
items = root.xpath(".//ul[@class='srp-results srp-list clearfix']/li[@class='s-item s-item__pl-on-bottom']")
data = []
for item in items:
data.append({
"name": item.xpath(".//div[@class='s-item__title']//text()")[0],
"condition": item.xpath(".//div[@class='s-item__subtitle']/span/text()")[0],
"price": "".join(item.xpath(".//span[@class='s-item__price']//text()")),
"purchase_options": "".join(item.xpath(".//span[@class='s-item__dynamic s-item__purchaseOptionsWithIcon']//text()")),
"shipping": "".join(item.xpath(".//span[@class='s-item__shipping s-item__logisticsCost']//text()"))
})
df = pd.DataFrame(data)
Comparison betwean