Scrapy Returning Data Outside of Specified Elements
Question:
I am trying to scrape the names of players from this page: https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard
To do that I first get the tables containing the batting scorecards:
batting_scorecard = response.xpath("//table[@class='ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table']")
Then I try to get the player names:
batting_scorecard.xpath("//a[contains(@href,'/player/')]/span/span/text()").getall()
This returns a list that contains all the player names (as well as some rubbish to be parsed) but it also contains names of players/umpires/referees who are not in the specified tables.
In the list below ‘Luke Wood’ (last occurrence), ‘Aleem Dar’, ‘Asif Yaqoob’, ‘Ahsan Raza’, ‘Rashid Riaz’, ‘Muhammad Javed’ should not be returned as they are in a different table. The batting_scorecard tables have class "ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table"
whereas this data is in a table with class "ds-w-full ds-table ds-table-sm ds-table-auto "
.
Can anyone see what the problem is?
['Mohammad Rizwan',
'xa0',
'Babar Azam',
'xa0',
'Haider Ali',
'xa0',
'Shan Masood',
'xa0',
'Iftikhar Ahmed',
'xa0',
'Mohammad Nawaz',
'xa0',
'Khushdil Shah',
'xa0',
'Naseem Shah',
'xa0',
'Usman Qadir',
'xa0',
'Haris Rauf',
',',
'xa0',
'Shahnawaz Dahani',
'xa0',
'Phil Salt',
'xa0',
'Alex Hales',
'xa0',
'Dawid Malan',
'xa0',
'Ben Duckett',
'xa0',
'Harry Brook',
'xa0',
'Moeen Ali',
'xa0',
'Sam Curran',
',',
'xa0',
'David Willey',
',',
'xa0',
'Adil Rashid',
',',
'xa0',
'Luke Wood',
',',
'xa0',
'Richard Gleeson',
'xa0',
'Luke Wood',
'Aleem Dar',
'Asif Yaqoob',
'Ahsan Raza',
'Rashid Riaz',
'Muhammad Javed',
'Mohammad Rizwan',
'xa0',
'Babar Azam',
'xa0',
'Haider Ali',
'xa0',
'Shan Masood',
'xa0',
'Iftikhar Ahmed',
'xa0',
'Mohammad Nawaz',
'xa0',
'Khushdil Shah',
'xa0',
'Naseem Shah',
'xa0',
'Usman Qadir',
'xa0',
'Haris Rauf',
',',
'xa0',
'Shahnawaz Dahani',
'xa0',
'Phil Salt',
'xa0',
'Alex Hales',
'xa0',
'Dawid Malan',
'xa0',
'Ben Duckett',
'xa0',
'Harry Brook',
'xa0',
'Moeen Ali',
'xa0',
'Sam Curran',
',',
'xa0',
'David Willey',
',',
'xa0',
'Adil Rashid',
',',
'xa0',
'Luke Wood',
',',
'xa0',
'Richard Gleeson',
'xa0',
'Luke Wood',
'Aleem Dar',
'Asif Yaqoob',
'Ahsan Raza',
'Rashid Riaz',
'Muhammad Javed']
Answers:
Change your selector to:
batting_scorecard.xpath(".//a[contains(@href,'/player/')]/span/span/text()").getall()
This way (by adding a dot in front of xpath), XPATH will only search within the actual element, not in the full page.
from scrapy.crawler import CrawlerProcess
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
def start_requests(self):
yield scrapy.Request(
url='https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard',
callback=self.parse,
)
def parse(self, response):
for player in response.xpath('//table[@class="ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table"]//tbody//tr')[::2]:
yield {
'Name':''.join(player.xpath('.//td[1]/a//text()').getall()).replace('xa0','')
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(TestSpider)
process.start()
Output:
{'Name': 'Mohammad Rizwan†'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Babar Azam(c)'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Haider Ali'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Shan Masood'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Iftikhar Ahmed'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Mohammad Nawaz'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Khushdil Shah'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Phil Salt†'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Alex Hales'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Dawid Malan'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Ben Duckett'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Harry Brook'}
I am trying to scrape the names of players from this page: https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard
To do that I first get the tables containing the batting scorecards:
batting_scorecard = response.xpath("//table[@class='ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table']")
Then I try to get the player names:
batting_scorecard.xpath("//a[contains(@href,'/player/')]/span/span/text()").getall()
This returns a list that contains all the player names (as well as some rubbish to be parsed) but it also contains names of players/umpires/referees who are not in the specified tables.
In the list below ‘Luke Wood’ (last occurrence), ‘Aleem Dar’, ‘Asif Yaqoob’, ‘Ahsan Raza’, ‘Rashid Riaz’, ‘Muhammad Javed’ should not be returned as they are in a different table. The batting_scorecard tables have class "ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table"
whereas this data is in a table with class "ds-w-full ds-table ds-table-sm ds-table-auto "
.
Can anyone see what the problem is?
['Mohammad Rizwan',
'xa0',
'Babar Azam',
'xa0',
'Haider Ali',
'xa0',
'Shan Masood',
'xa0',
'Iftikhar Ahmed',
'xa0',
'Mohammad Nawaz',
'xa0',
'Khushdil Shah',
'xa0',
'Naseem Shah',
'xa0',
'Usman Qadir',
'xa0',
'Haris Rauf',
',',
'xa0',
'Shahnawaz Dahani',
'xa0',
'Phil Salt',
'xa0',
'Alex Hales',
'xa0',
'Dawid Malan',
'xa0',
'Ben Duckett',
'xa0',
'Harry Brook',
'xa0',
'Moeen Ali',
'xa0',
'Sam Curran',
',',
'xa0',
'David Willey',
',',
'xa0',
'Adil Rashid',
',',
'xa0',
'Luke Wood',
',',
'xa0',
'Richard Gleeson',
'xa0',
'Luke Wood',
'Aleem Dar',
'Asif Yaqoob',
'Ahsan Raza',
'Rashid Riaz',
'Muhammad Javed',
'Mohammad Rizwan',
'xa0',
'Babar Azam',
'xa0',
'Haider Ali',
'xa0',
'Shan Masood',
'xa0',
'Iftikhar Ahmed',
'xa0',
'Mohammad Nawaz',
'xa0',
'Khushdil Shah',
'xa0',
'Naseem Shah',
'xa0',
'Usman Qadir',
'xa0',
'Haris Rauf',
',',
'xa0',
'Shahnawaz Dahani',
'xa0',
'Phil Salt',
'xa0',
'Alex Hales',
'xa0',
'Dawid Malan',
'xa0',
'Ben Duckett',
'xa0',
'Harry Brook',
'xa0',
'Moeen Ali',
'xa0',
'Sam Curran',
',',
'xa0',
'David Willey',
',',
'xa0',
'Adil Rashid',
',',
'xa0',
'Luke Wood',
',',
'xa0',
'Richard Gleeson',
'xa0',
'Luke Wood',
'Aleem Dar',
'Asif Yaqoob',
'Ahsan Raza',
'Rashid Riaz',
'Muhammad Javed']
Change your selector to:
batting_scorecard.xpath(".//a[contains(@href,'/player/')]/span/span/text()").getall()
This way (by adding a dot in front of xpath), XPATH will only search within the actual element, not in the full page.
from scrapy.crawler import CrawlerProcess
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
custom_settings = {
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
def start_requests(self):
yield scrapy.Request(
url='https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard',
callback=self.parse,
)
def parse(self, response):
for player in response.xpath('//table[@class="ds-w-full ds-table ds-table-md ds-table-auto ci-scorecard-table"]//tbody//tr')[::2]:
yield {
'Name':''.join(player.xpath('.//td[1]/a//text()').getall()).replace('xa0','')
}
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(TestSpider)
process.start()
Output:
{'Name': 'Mohammad Rizwan†'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Babar Azam(c)'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Haider Ali'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Shan Masood'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Iftikhar Ahmed'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Mohammad Nawaz'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Khushdil Shah'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': ''}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Phil Salt†'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Alex Hales'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Dawid Malan'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Ben Duckett'}
2022-09-27 15:38:34 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.espncricinfo.com/series/england-in-pakistan-2022-1327226/pakistan-vs-england-1st-t20i-1327228/full-scorecard>
{'Name': 'Harry Brook'}