Python WebScraper w/ BeautifulSoup: Not Scraping All Pages

Question:

I’m a brand new coder who was tasked (by my company) with making a web scraper for eBay, to assist the CFO in finding inventory items when we need them. I’ve got it developed to scrape from multiple pages, but when the Pandas DataFrame loads, the number of results does not match how many pages it’s supposed to be scraping. Here is the code (I am using iPads just for the sheer volume and degree of variance in the results):

import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

data = []

# searchkey = input()
# base_url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240
base_url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=60'

for page in range(1, 11):
    page_url = base_url + '&_pgn=' + str(page)
    time.sleep(10)
    soup = BeautifulSoup(requests.get(page_url).text)
    for links in soup.select('.srp-results li.s-item'):
        item_url = links.a['href']
        soup2 = BeautifulSoup(requests.get(item_url).text)
        for content in soup2.select('.lsp-c'):
            data.append({
                'item_name' : content.select_one('h1.x-item-title__mainTitle > span').text,
                'name' : 'Click Here to see Webpage',
                'url' : str(item_url),
                'hot' : "Hot!" if content.select_one('div.d-urgency') else "",
                'condition' : content.select_one('span.clipped').text,
                'price' : content.select_one('div.x-price-primary > span').text,
                'make offer' : 'Make Offer' if content.select_one('div.x-offer-action') else str('Contact Seller')
            })

df = pd.DataFrame(data)

df['link'] = df['name'] + '#' + df['url']
def make_clickable_both(val): 
    name, url = val.split('#')
    return f'<a href="{url}">{name}</a>' 

df2 = df.drop(columns=['name', 'url'])
df2.style.format({'link': make_clickable_both})

The results of these appear like so:

item_name hot condition price make offer link
0 Apple iPad Air 2 2nd WiFi + Ce… Hot! Good – Refurbished US $169.99 Contact Seller Click Here to see Webpage
1 Apple iPad 2nd 3rd 4th Generat… Hot! Used US $64.99 Contact Seller Click Here to see Webpage
2 Apple iPad 6th 9.7" 2018 Wifi … Very Good – Refurbished US $189.85 Contact Seller Click Here to see Webpage
3 Apple iPad Air 1st 2nd Generat… Hot! Used US $54.89/ea Contact Seller Click Here to see Webpage
4 Apple 10.2" iPad 9th Generatio… Hot! Open box US $269.00 Contact Seller Click Here to see Webpage
300 Apple iPad 8th 10.2" Wifi or… Good – Refurbished US $229.85 Contact Seller Click Here to see Webpage

Which is great! That last column is even a clickable link, just as the function defines, and operates properly. However, based off of my URL it’s just about half the data I should have received.

So, in the URL, the two key things related to this are page_url = base_url + '&_pgn=' + str(page), which is how I determine the page number for each URL to get the list of links off of, and &_ipg=60, which is what determines how many items are loaded on the page (eBay has 3 options for this: 60, 120, 240). So based on my current settings (pagination giving me 10 pages and item amount set to 60), I should be seeing roughly 600 results or so, but Instead I got 300. I added the timer to see if letting it load for a little longer or something between each page would help me get all the results, but I’ve had no such luck. Anyone got ideas about what I did wrong, or what I can do to improve? Any bit of info is appreciated!

Asked By: Flintzer0

||

Answers:

Starting page 5, pages seem to be rendered differently and soup.select('.srp-results li.s-item') always returns an empty list (of urls).

That is why data length remains stuck at 300, even though there are more results.

So, there is nothing wrong with your code and there is no need to pause for 10 seconds.

Leaving the code unchanged, your best option is to set &_ipg to 240, you get more, if not all, results (after a certain time):

print(df.info())
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   item_name   1020 non-null   object
 1   name        1020 non-null   object
 2   url         1020 non-null   object
 3   hot         1020 non-null   object
 4   condition   1020 non-null   object
 5   price       1020 non-null   object
 6   make offer  1020 non-null   object
dtypes: object(7)
memory usage: 55.9+ KB
Answered By: Laurent

I actually dug more into what popped up when parsing the HTML, and discovered it was because of eBay denying access passed 5 pages of results to bots! So, changing my code to add:

headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)

it actually fixes the issue! Should have known.

Answered By: Flintzer0