How To Use FindAll While Web Scraping

Question:

I want to scrape https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw=xbox&_pgn=2&_skc=50&rt=nc and get the tiles (Microsoft Xbox 360 E 250 GB Black Console, Microsoft Xbox One S 1TB Console White with 2 Wireless Controllers etc). In due course I want to feed the Python script different eBay URLS but for the sake of this question, I just want to focus on one specific eBay URL.

I then want to add them titles to a data frame which I would write to Excel. I think I can do this part myself.

Did not work –

for post in soup.findAll('a',id='ListViewInner'):
    print (post.get('href'))

Did not work –

for post in soup.findAll('a',id='body'):
      print (post.get('href'))

Did not work –

for post in soup.findAll('a',id='body'):
   print (post.get('href'))

h1 = soup.find("a",{"class":"lvtitle"})
print(h1)

Did not work –

for post in soup.findAll('a',attrs={"class":"left-center"}):
    print (post.get('href'))

Did not work –

for post in soup.findAll('a',{'id':'ListViewInner'}):
    print (post.get('href'))

This gave me links for the wrong parts of the web page, I know href is hyperlinks and not titles but I figured if the below code had worked, I could amend it for titles –

for post in soup.findAll('a'):
    print (post.get('href'))

Here is all my code –

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import urllib.request
from bs4 import BeautifulSoup

#BaseURL, Syntax1 and Syntax2 should be standard across all
#Ebay URLs, whereas Request and PageNumber can change 

BaseURL = "https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw="

Syntax1 = "&_skc=50&rt=nc"

Request = "xbox"

Syntax2  = "&_pgn="

PageNumber ="2"

URL = BaseURL + Request + Syntax2 + PageNumber + Syntax1


print (URL)
HTML = urllib.request.urlopen(URL).read()

#print(HTML)

soup=b(HTML,"html.parser")

#print (soup)

for post in soup.findAll('a'):
    print (post.get('href'))
Asked By: Ross Symonds

||

Answers:

Use css selector which is much faster.

import requests
from bs4 import  BeautifulSoup

url = 'https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw=xbox&_pgn=2&_skc=50&rt=nc'
Res = requests.get(url)
soup = BeautifulSoup(Res.text,'html.parser')
for post in soup.select("#ListViewInner a"):
    print(post.get('href'))

Use format() function instead of concatenation string.

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import urllib.request
from bs4 import BeautifulSoup

BaseURL = "https://www.ebay.co.uk/sch/i.html?_from=R40&_sacat=0&_nkw={}&_pgn={}&_skc={}&rt={}"

skc = "50"
rt = "nc"
Request = "xbox"
PageNumber = "2"

URL = BaseURL.format(Request,PageNumber,skc,rt)
print(URL)
HTML = urllib.request.urlopen(URL).read()
soup = BeautifulSoup(HTML,"html.parser")
for post in soup.select('#ListViewInner a'):
    print(post.get('href'))
Answered By: KunduK

I see you set the second page for search in the parameters, but you can also extract data from all pages using non-token based pagination.

Using CSS selectors to find the required elements on the page can help you quickly find those elements without using the browser dev tools.

SelectorGadget Chrome Extension will help you with this, does not always work perfectly if the page is heavily using JS ( in this case we can).

Also, if you need to extract data from other eBay domains, it will be enough to replace only the domain with the one you need, the rest of the code will remain unchanged.

Check code in the online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml


# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
}
   
params = {
    '_nkw': 'xbox',         # search query 
    '_pgn': 1               # page number
}

data = []
limit = 5                   # page limit (if needed)
while True:
    page = requests.get('https://www.ebay.co.uk/sch/i.html', params=params, headers=headers, timeout=30)
    soup = BeautifulSoup(page.text, 'lxml')
    
    print(f"Extracting page: {params['_pgn']}")

    print("-" * 10)
    
    for products in soup.select(".s-item__info"):
        link = products.select_one(".s-item__link")["href"]
        
        data.append({
          "link": link 
        })

    if params['_pgn'] == limit:
       break
    if soup.select_one(".pagination__next"):
        params['_pgn'] += 1
    else:
        break

print(json.dumps(data, indent=2, ensure_ascii=False))

Example output:

[
  {
    "link": "https://www.ebay.com/itm/313995892137?hash=item491b9d15a9:g:J2MAAOSw~bRifpzj&amdata=enc%3AAQAHAAAA4DpMGtQHvolyljxCOQKcSquVbHSQsUxFYTzdIfd%2BFkL7uu22tZCTlfjunr%2FpezpiuyX53Lx%2Famof4B%2Bayuz%2FvGnW2cRNM7nT18fBpwJqEwLw8ly5DWvtiJvCWKSCN8%2F1YhUQFELpWSnR%2BRsQT2C5AoZ8Azjj4etSBU4vF%2BQlFFFGpf%2BPRMLP5cQJqd%2F2KSCollWb7yq%2BX3IURDrB1tXtbF6P5Xd%2FhzUNIzVv02HkTCkwv3ojjl9KV2MLpfA9w8%2Be6xnnx5AfhWiGSZZ7V9Okvipr%2B3HWcszN4uTgloALBrns%7Ctkp%3ABk9SR7rH1qnZYQ"
  },
  {
    "link": "https://www.ebay.com/itm/304770170066?hash=item46f5b7bcd2:g:ebcAAOSwT9djwh95&amdata=enc%3AAQAHAAAAwJcxtd62nuajIk56MLeCXB7AXEbzKmkt99dXdzIej2bd63pct6Ncbt85ws%2F%2B1OQCe51pLeY9ZytskHeS5lNeOvj31CGOA4q6N3dPyHowj1vnOqqUo5piWWNzCYbFapsH6FOAZ09aFfiUCuxt0yAjTDSeJJD3t36walFXnDme7W7mOEcFBGCs5JLVNlx9ZETOh4VXNzYiKKov1lZs%2BjA%2BRY4oaBDDAxR6FB33NCO02j5OzPlfw8KAUxnpJPnWfA8ZNQ%3D%3D%7Ctkp%3ABk9SR7rH1qnZYQ"
  },
  {
    "link": "https://www.ebay.com/itm/384820310840?hash=item599913f338:g:cEQAAOSw~QNiTeGO:sc:ShippingMethodStandard!29405!US!-1&amdata=enc%3AAQAHAAAA4D4Ig10eel0xwkapJj05fqHi76GUNC0DZPJXHh7MahTM2nf6K9f26IQ0tlXAW3zwb6JBqA%2Fy3pbU%2Bx%2BidkkQzhXQWUeBY3ybe1DE%2F3jDwFcnh%2FL6bmbtT265oHpegLadvV92ZfGyfexeyqQRCzLxXO5PgOCyXvWt470Q7RdGJ2iVsStKQK9e85x%2FJzpe2nyNZQZvo%2BvaVREej%2F4LN9UmO7bhDJpF%2Bm%2BL%2BtkTuao4YkVLFR%2F6Lqqv2kPVdwLg880w9mct5r%2BmPxclXYBaDexsGLTCNY6qdOf6RJo5zaPombCD%7Ctkp%3ABFBMusfWqdlh"
  },
  other results ...
]

As an alternative, you can use Ebay Organic Results API from SerpApi. It’s a paid API with a free plan that handles blocks and parsing on their backend.

Example code with pagination:

from serpapi import EbaySearch
import json

params = {
    "api_key": "...",                 # serpapi key, https://serpapi.com/manage-api-key   
    "engine": "ebay",                 # search engine
    "ebay_domain": "ebay.co.uk",      # ebay domain
    "_nkw": "xbox",                   # search query
    "LH_Sold": "1",                   # shows sold items
    "_pgn": 1                         # page number
}

search = EbaySearch(params)           # where data extraction happens

limit = 5
page_num = 0
data = []

while True:
    results = search.get_dict()       # JSON -> Python dict

    if "error" in results:
        print(results["error"])
        break
    
    for organic_result in results.get("organic_results", []):
        data.append({"Link": organic_result.get("link")})
                    
    page_num += 1
    print(page_num)

    if params['_pgn'] == limit:
       break
    if "next" in results.get("pagination", {}):
        params['_pgn'] += 1
    else:
        break
      
print(json.dumps(data, indent=2, ensure_ascii=False))

Output: same as bs4 solution

There’s a 13 ways to scrape any public data from any website blog post if you want to know more about website scraping.

Answered By: Denis Skopa