I can't get values page by page with for-in-loop

Question

As title,I could get the values in just first page, but I can’t get values page by page with for-in-loop.
I’ve chek my code, but I’m still confused with it. How could I get that values in every page?

# Imports Required
!pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import requests
from bs4 import BeautifulSoup

browser = webdriver.Chrome(executable_path='./chromedriver.exe')
wait = WebDriverWait(browser,5)
output = list()
for i in range(1,2): 
    browser.get("https://www.rakuten.com.tw/shop/watsons/product/?l-id=tw_shop_inshop_cat&p={}".format(i))
    
    # Wait Until the product appear
    wait.until(EC.presence_of_element_located((By.XPATH,"//div[@class='b-content b-fix-2lines']")))

    # Get the products link
    product_links = browser.find_elements(By.XPATH,"//div[@class='b-content b-fix-2lines']/b/a")
    
    # Iterate over 'product_links' to get all the 'href' values
  
    for link in (product_links):
        print(link.get_attribute('href'))
        browser.get(link.get_attribute('href'))
        soup = BeautifulSoup(browser.page_source)
        products =[]
        product = {}
        product['商品名稱'] = soup.find('div',class_="b-subarea b-layout-right shop-item ng-scope").h1.text.replace('n','')
        product['價錢'] = soup.find('strong',class_="b-text-xlarge qa-product-actualPrice").text.replace('n','')
        all_data=soup.find_all("div",class_="b-container-child")[2]
        main_data=all_data.find_all("span")[-1]
        product['購買次數'] = main_data.text
        products.append(product)
        print(products)

Asked By: 鄭鼎彥

||

Source

Answer 1

    product_links = browser.find_elements(By.XPATH,"//div[@class='b-content b-fix-2lines']/b/a")
    
    # Iterate over 'product_links' to get all the 'href' values
  
    for link in (product_links):
        print(link.get_attribute('href'))
        browser.get(link.get_attribute('href'))

The problem is that when you do browser.get(), it invalidates the HTML element referred to by product_links because it no longer exists in the current page. You should get all of the 'href' attributes into an array. One way is with a list comprehension:

links = [link.get_attribute('href') for link in product_links]

Now you can loop over the strings in links to load new pages.

With that said, you should look at the library scrapy which can do a lot of the heavy lifting for you.

Answered By: Code-Apprentice

Answer 2

You can scrape this website using BeautifulSoup web scraping library without the need to use selenium, it will be much faster than launching the whole browser.

Problems with site parsing may arise because when you try to request a site, it may consider that this is a bot, so that this does not happen, you need to send headers that contain user-agent in the request, then the site will assume that you’re a user and display information.

The request might be blocked (if using requests as default user-agent in requests library is a python-requests.

An additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.

Code that extracts data from all pages without hardcoded page numbers and full example in online IDE.

from bs4 import BeautifulSoup
import requests, json, lxml


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}

data = []
page_num = 1

while True:
    html = requests.get(f"https://www.rakuten.com.tw/shop/watsons/product/?p={page_num}", headers=headers, timeout=30)
    soup = BeautifulSoup(html.text, "lxml")

    print(f"Extracting page: {page_num}")

    print("-" * 10)

    for result in soup.select(".b-item"):
        title = result.select_one(".product-name").text.strip()
        price = result.select_one(".b-underline").text.strip()

        data.append({
          "title" : title,
          "price" : price
        })

    if soup.select_one(".arrow-right-icon"):
        page_num += 1
    else:
        break

    print(json.dumps(data, indent=2, ensure_ascii=False))

Example output

Extracting page: 1
----------
[
  {
    "title": "桂格無糖養氣人蔘盒裝19瓶",
    "price": "989 元"
  },
  {
    "title": "DR.WU杏仁酸溫和煥膚精華15ML",
    "price": "800 元"
  },
  {
    "title": "桂格養氣人蔘盒裝19瓶",
    "price": "989 元"
  },
  {
    "title": "天地合補高單位葡萄糖胺飲60mlx18入",
    "price": "939 元"
  },
  {
    "title": "幫寶適超薄乾爽XL號紙尿褲尿布136片裝(68片/包)",
    "price": "1,189 元"
  },
  {
    "title": "耶歐雙氧保養液360ml*3網路獨家品",
    "price": "699 元"
  },
  {
    "title": "得意抽取式花紋衛生紙100抽10包7串(箱)",
    "price": "689 元"
  },
  {
    "title": "老協珍熬雞精14入",
    "price": "1,588 元"
  },
  {
    "title": "桂格活靈芝盒裝19瓶",
    "price": "989 元"
  },
  {
    "title": "善存葉黃素20mg 60錠",
    "price": "689 元"
  },
  {
    "title": "桂格養氣人蔘雞精-雙效滋補盒裝18瓶",
    "price": "799 元"
  },
  {
    "title": "天地合補含鐵玫瑰四物飲12入",
    "price": "585 元"
  },
  {
    "title": "好立善葉黃素軟膠囊30粒",
    "price": "199 元"
  },
  {
    "title": "全久榮75度防疫酒精350ml",
    "price": "45 元"
  },
  {
    "title": "白蘭氏雙認證雞精12入",
    "price": "699 元"
  },
  {
    "title": "保麗淨-假牙黏著劑  無味70g",
    "price": "296 元"
  },
  {
    "title": "義美生醫常順軍益生菌-30入",
    "price": "680 元"
  },
  {
    "title": "克補+鋅加強錠-禮盒(60+30錠) 2入組合",
    "price": "1,249 元"
  },
  {
    "title": "康乃馨寶寶潔膚濕巾超厚型80片2包(屈臣氏獨家)",
    "price": "69 元"
  },
  {
    "title": "天地合補青木瓜四物飲120ml*12瓶入",
    "price": "579 元"
  }
]
Extracting page: 2
----------
[
  {
    "title": "桂格無糖養氣人蔘盒裝19瓶",
    "price": "989 元"
  },
  {
    "title": "DR.WU杏仁酸溫和煥膚精華15ML",
    "price": "800 元"
  },
  {
    "title": "桂格養氣人蔘盒裝19瓶",
    "price": "989 元"
  },
  {
    "title": "天地合補高單位葡萄糖胺飲60mlx18入",
    "price": "939 元"
  },
  {
    "title": "幫寶適超薄乾爽XL號紙尿褲尿布136片裝(68片/包)",
    "price": "1,189 元"
  },
      # ...
]

Answered By: Denis Skopa

I can't get values page by page with for-in-loop

Question:

Answers: