Scrapy Returning Data Outside of Specified Elements

Question:

I am trying to scrape the names of players from this page: https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard

To do that I first get the tables containing the batting scorecards:

batting_scorecard = response.xpath("//table[@class='ds-w-full ds-table ds-table-md ds-table-auto  ci-scorecard-table']")

Then I try to get the player names:

batting_scorecard.xpath("//a[contains(@href,'/player/')]/span/span/text()").getall()

This returns a list that contains all the player names (as well as some rubbish to be parsed) but it also contains names of players/umpires/referees who are not in the specified tables.

In the list below ‘Luke Wood’ (last occurrence), ‘Aleem Dar’, ‘Asif Yaqoob’, ‘Ahsan Raza’, ‘Rashid Riaz’, ‘Muhammad Javed’ should not be returned as they are in a different table. The batting_scorecard tables have class "ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table" whereas this data is in a table with class "ds-w-full ds-table ds-table-sm ds-table-auto ".

Can anyone see what the problem is?

['Mohammad Rizwan',
 'xa0',
 'Babar Azam',
 'xa0',
 'Haider Ali',
 'xa0',
 'Shan Masood',
 'xa0',
 'Iftikhar Ahmed',
 'xa0',
 'Mohammad Nawaz',
 'xa0',
 'Khushdil Shah',
 'xa0',
 'Naseem Shah',
 'xa0',
 'Usman Qadir',
 'xa0',
 'Haris Rauf',
 ',',
 'xa0',
 'Shahnawaz Dahani',
 'xa0',
 'Phil Salt',
 'xa0',
 'Alex Hales',
 'xa0',
 'Dawid Malan',
 'xa0',
 'Ben Duckett',
 'xa0',
 'Harry Brook',
 'xa0',
 'Moeen Ali',
 'xa0',
 'Sam Curran',
 ',',
 'xa0',
 'David Willey',
 ',',
 'xa0',
 'Adil Rashid',
 ',',
 'xa0',
 'Luke Wood',
 ',',
 'xa0',
 'Richard Gleeson',
 'xa0',
 'Luke Wood',
 'Aleem Dar',
 'Asif Yaqoob',
 'Ahsan Raza',
 'Rashid Riaz',
 'Muhammad Javed',
 'Mohammad Rizwan',
 'xa0',
 'Babar Azam',
 'xa0',
 'Haider Ali',
 'xa0',
 'Shan Masood',
 'xa0',
 'Iftikhar Ahmed',
 'xa0',
 'Mohammad Nawaz',
 'xa0',
 'Khushdil Shah',
 'xa0',
 'Naseem Shah',
 'xa0',
 'Usman Qadir',
 'xa0',
 'Haris Rauf',
 ',',
 'xa0',
 'Shahnawaz Dahani',
 'xa0',
 'Phil Salt',
 'xa0',
 'Alex Hales',
 'xa0',
 'Dawid Malan',
 'xa0',
 'Ben Duckett',
 'xa0',
 'Harry Brook',
 'xa0',
 'Moeen Ali',
 'xa0',
 'Sam Curran',
 ',',
 'xa0',
 'David Willey',
 ',',
 'xa0',
 'Adil Rashid',
 ',',
 'xa0',
 'Luke Wood',
 ',',
 'xa0',
 'Richard Gleeson',
 'xa0',
 'Luke Wood',
 'Aleem Dar',
 'Asif Yaqoob',
 'Ahsan Raza',
 'Rashid Riaz',
 'Muhammad Javed']
Asked By: Andy

||

Answers:

Change your selector to:

batting_scorecard.xpath(".//a[contains(@href,'/player/')]/span/span/text()").getall()

This way (by adding a dot in front of xpath), XPATH will only search within the actual element, not in the full page.

Answered By: Barry the Platipus
from scrapy.crawler import CrawlerProcess

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'
    custom_settings = {
        'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
    
    def start_requests(self):
        yield scrapy.Request(
            url='https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard',
            callback=self.parse,
           
            )
    
    def parse(self, response):
        for player in response.xpath('//table[@class="ds-w-full ds-table ds-table-md ds-table-auto  ci-scorecard-table"]//tbody//tr')[::2]:
            yield {
                'Name':''.join(player.xpath('.//td[1]/a//text()').getall()).replace('xa0','')
            }
           
          
        
if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(TestSpider)
    process.start()

Output:

{'Name': 'Mohammad Rizwan†'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Babar Azam(c)'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Haider Ali'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Shan Masood'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Iftikhar Ahmed'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Mohammad Nawaz'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Khushdil Shah'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Phil Salt†'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Alex Hales'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Dawid Malan'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Ben Duckett'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Harry Brook'}
Answered By: F.Hoque
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.