Python WebScraper w/ BeautifulSoup: Not Scraping All Pages
Question:
I’m a brand new coder who was tasked (by my company) with making a web scraper for eBay, to assist the CFO in finding inventory items when we need them. I’ve got it developed to scrape from multiple pages, but when the Pandas DataFrame loads, the number of results does not match how many pages it’s supposed to be scraping. Here is the code (I am using iPads just for the sheer volume and degree of variance in the results):
import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
data = []
# searchkey = input()
# base_url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240
base_url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=60'
for page in range(1, 11):
page_url = base_url + '&_pgn=' + str(page)
time.sleep(10)
soup = BeautifulSoup(requests.get(page_url).text)
for links in soup.select('.srp-results li.s-item'):
item_url = links.a['href']
soup2 = BeautifulSoup(requests.get(item_url).text)
for content in soup2.select('.lsp-c'):
data.append({
'item_name' : content.select_one('h1.x-item-title__mainTitle > span').text,
'name' : 'Click Here to see Webpage',
'url' : str(item_url),
'hot' : "Hot!" if content.select_one('div.d-urgency') else "",
'condition' : content.select_one('span.clipped').text,
'price' : content.select_one('div.x-price-primary > span').text,
'make offer' : 'Make Offer' if content.select_one('div.x-offer-action') else str('Contact Seller')
})
df = pd.DataFrame(data)
df['link'] = df['name'] + '#' + df['url']
def make_clickable_both(val):
name, url = val.split('#')
return f'<a href="{url}">{name}</a>'
df2 = df.drop(columns=['name', 'url'])
df2.style.format({'link': make_clickable_both})
The results of these appear like so:
item_name
hot
condition
price
make offer
link
0
Apple iPad Air 2 2nd WiFi + Ce…
Hot!
Good – Refurbished
US $169.99
Contact Seller
Click Here to see Webpage
1
Apple iPad 2nd 3rd 4th Generat…
Hot!
Used
US $64.99
Contact Seller
Click Here to see Webpage
2
Apple iPad 6th 9.7" 2018 Wifi …
Very Good – Refurbished
US $189.85
Contact Seller
Click Here to see Webpage
3
Apple iPad Air 1st 2nd Generat…
Hot!
Used
US $54.89/ea
Contact Seller
Click Here to see Webpage
4
Apple 10.2" iPad 9th Generatio…
Hot!
Open box
US $269.00
Contact Seller
Click Here to see Webpage
…
300
Apple iPad 8th 10.2" Wifi or…
Good – Refurbished
US $229.85
Contact Seller
Click Here to see Webpage
Which is great! That last column is even a clickable link, just as the function defines, and operates properly. However, based off of my URL it’s just about half the data I should have received.
So, in the URL, the two key things related to this are page_url = base_url + '&_pgn=' + str(page)
, which is how I determine the page number for each URL to get the list of links off of, and &_ipg=60
, which is what determines how many items are loaded on the page (eBay has 3 options for this: 60, 120, 240). So based on my current settings (pagination giving me 10 pages and item amount set to 60), I should be seeing roughly 600 results or so, but Instead I got 300. I added the timer to see if letting it load for a little longer or something between each page would help me get all the results, but I’ve had no such luck. Anyone got ideas about what I did wrong, or what I can do to improve? Any bit of info is appreciated!
Answers:
Starting page 5, pages seem to be rendered differently and soup.select('.srp-results li.s-item')
always returns an empty list (of urls).
That is why data
length remains stuck at 300, even though there are more results.
So, there is nothing wrong with your code and there is no need to pause for 10 seconds.
Leaving the code unchanged, your best option is to set &_ipg
to 240
, you get more, if not all, results (after a certain time):
print(df.info())
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_name 1020 non-null object
1 name 1020 non-null object
2 url 1020 non-null object
3 hot 1020 non-null object
4 condition 1020 non-null object
5 price 1020 non-null object
6 make offer 1020 non-null object
dtypes: object(7)
memory usage: 55.9+ KB
I actually dug more into what popped up when parsing the HTML, and discovered it was because of eBay denying access passed 5 pages of results to bots! So, changing my code to add:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)
it actually fixes the issue! Should have known.
I’m a brand new coder who was tasked (by my company) with making a web scraper for eBay, to assist the CFO in finding inventory items when we need them. I’ve got it developed to scrape from multiple pages, but when the Pandas DataFrame loads, the number of results does not match how many pages it’s supposed to be scraping. Here is the code (I am using iPads just for the sheer volume and degree of variance in the results):
import time
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup
data = []
# searchkey = input()
# base_url = 'https://www.ebay.com/sch/i.html?_nkw=' + searchkey + '&_sacat=0&_ipg=240
base_url = 'https://www.ebay.com/sch/i.html?_nkw=ipads&_sacat=0&_ipg=60'
for page in range(1, 11):
page_url = base_url + '&_pgn=' + str(page)
time.sleep(10)
soup = BeautifulSoup(requests.get(page_url).text)
for links in soup.select('.srp-results li.s-item'):
item_url = links.a['href']
soup2 = BeautifulSoup(requests.get(item_url).text)
for content in soup2.select('.lsp-c'):
data.append({
'item_name' : content.select_one('h1.x-item-title__mainTitle > span').text,
'name' : 'Click Here to see Webpage',
'url' : str(item_url),
'hot' : "Hot!" if content.select_one('div.d-urgency') else "",
'condition' : content.select_one('span.clipped').text,
'price' : content.select_one('div.x-price-primary > span').text,
'make offer' : 'Make Offer' if content.select_one('div.x-offer-action') else str('Contact Seller')
})
df = pd.DataFrame(data)
df['link'] = df['name'] + '#' + df['url']
def make_clickable_both(val):
name, url = val.split('#')
return f'<a href="{url}">{name}</a>'
df2 = df.drop(columns=['name', 'url'])
df2.style.format({'link': make_clickable_both})
The results of these appear like so:
item_name | hot | condition | price | make offer | link | |
---|---|---|---|---|---|---|
0 | Apple iPad Air 2 2nd WiFi + Ce… | Hot! | Good – Refurbished | US $169.99 | Contact Seller | Click Here to see Webpage |
1 | Apple iPad 2nd 3rd 4th Generat… | Hot! | Used | US $64.99 | Contact Seller | Click Here to see Webpage |
2 | Apple iPad 6th 9.7" 2018 Wifi … | Very Good – Refurbished | US $189.85 | Contact Seller | Click Here to see Webpage | |
3 | Apple iPad Air 1st 2nd Generat… | Hot! | Used | US $54.89/ea | Contact Seller | Click Here to see Webpage |
4 | Apple 10.2" iPad 9th Generatio… | Hot! | Open box | US $269.00 | Contact Seller | Click Here to see Webpage |
… | ||||||
300 | Apple iPad 8th 10.2" Wifi or… | Good – Refurbished | US $229.85 | Contact Seller | Click Here to see Webpage |
Which is great! That last column is even a clickable link, just as the function defines, and operates properly. However, based off of my URL it’s just about half the data I should have received.
So, in the URL, the two key things related to this are page_url = base_url + '&_pgn=' + str(page)
, which is how I determine the page number for each URL to get the list of links off of, and &_ipg=60
, which is what determines how many items are loaded on the page (eBay has 3 options for this: 60, 120, 240). So based on my current settings (pagination giving me 10 pages and item amount set to 60), I should be seeing roughly 600 results or so, but Instead I got 300. I added the timer to see if letting it load for a little longer or something between each page would help me get all the results, but I’ve had no such luck. Anyone got ideas about what I did wrong, or what I can do to improve? Any bit of info is appreciated!
Starting page 5, pages seem to be rendered differently and soup.select('.srp-results li.s-item')
always returns an empty list (of urls).
That is why data
length remains stuck at 300, even though there are more results.
So, there is nothing wrong with your code and there is no need to pause for 10 seconds.
Leaving the code unchanged, your best option is to set &_ipg
to 240
, you get more, if not all, results (after a certain time):
print(df.info())
# Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020 entries, 0 to 1019
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 item_name 1020 non-null object
1 name 1020 non-null object
2 url 1020 non-null object
3 hot 1020 non-null object
4 condition 1020 non-null object
5 price 1020 non-null object
6 make offer 1020 non-null object
dtypes: object(7)
memory usage: 55.9+ KB
I actually dug more into what popped up when parsing the HTML, and discovered it was because of eBay denying access passed 5 pages of results to bots! So, changing my code to add:
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
soup = BeautifulSoup(requests.get(base_url, headers=headers).text)
it actually fixes the issue! Should have known.