Scrapy Crawl (referer: None) ['partial']
Question:
I am new at scrapy and python. I am trying to scrap data from www.freepatentonline.com
. Here is my code.
class FreePatentSpider(scrapy.Spider):
name = 'freepatent'
allowed_domains = ['freepatentsonline.com']
search_value = 'laptop'
start_urls = [f'https://www.freepatentsonline.com/result.html?sort=relevance&srch=top&query_txt={search_value}&submit=&patents_us=on']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def request_header(self):
yield scrapy.Request(url=self.start_urls, callback=self.parse, headers={'User-Agent':self.user_agent})
def parse(self, response):
for data in response.xpath("//table[@class='listing_table']/tbody/tr/td/a"):
title = data.xpath(".//text()").get()
related_link = data.xpath(".//@href").get()
absolute_url = f"https://www.freepatentsonline.com{related_link}"
yield{
'title':title,
'related_link':related_link,
'absolute_url':absolute_url
}
I am getting
2023-01-17 20:00:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-01-17 20:00:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.freepatentsonline.com/result.html?sort=relevance&srch=top&query_txt=laptop&submit=&patents_us=on> (referer: None) ['partial']
2023-01-17 20:00:42 [scrapy.core.engine] INFO: Closing spider (finished)
Debug Crawl Status is 200 but I don’t know why it is not scraping data.
can you please help me?
Answers:
The request_header
method isn’t doing anything so you can remove that, and it looks like the table you are trying to scrape doesn’t have a <tbody>
element, which is why your xpath is failing and you are getting no results.
Try this:
class FreePatentSpider(scrapy.Spider):
name = 'freepatent'
allowed_domains = ['freepatentsonline.com']
search_value = 'laptop'
start_urls = [f'https://www.freepatentsonline.com/result.html?sort=relevance&srch=top&query_txt={search_value}&submit=&patents_us=on']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def parse(self, response):
for data in response.xpath("//table[@class='listing_table']//td//a"):
title = data.xpath(".//text()").get()
related_link = data.xpath(".//@href").get()
absolute_url = f"https://www.freepatentsonline.com{related_link}"
yield{
'title':title,
'related_link':related_link,
'absolute_url':absolute_url
}
I am new at scrapy and python. I am trying to scrap data from www.freepatentonline.com
. Here is my code.
class FreePatentSpider(scrapy.Spider):
name = 'freepatent'
allowed_domains = ['freepatentsonline.com']
search_value = 'laptop'
start_urls = [f'https://www.freepatentsonline.com/result.html?sort=relevance&srch=top&query_txt={search_value}&submit=&patents_us=on']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def request_header(self):
yield scrapy.Request(url=self.start_urls, callback=self.parse, headers={'User-Agent':self.user_agent})
def parse(self, response):
for data in response.xpath("//table[@class='listing_table']/tbody/tr/td/a"):
title = data.xpath(".//text()").get()
related_link = data.xpath(".//@href").get()
absolute_url = f"https://www.freepatentsonline.com{related_link}"
yield{
'title':title,
'related_link':related_link,
'absolute_url':absolute_url
}
I am getting
2023-01-17 20:00:41 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-01-17 20:00:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.freepatentsonline.com/result.html?sort=relevance&srch=top&query_txt=laptop&submit=&patents_us=on> (referer: None) ['partial']
2023-01-17 20:00:42 [scrapy.core.engine] INFO: Closing spider (finished)
Debug Crawl Status is 200 but I don’t know why it is not scraping data.
can you please help me?
The request_header
method isn’t doing anything so you can remove that, and it looks like the table you are trying to scrape doesn’t have a <tbody>
element, which is why your xpath is failing and you are getting no results.
Try this:
class FreePatentSpider(scrapy.Spider):
name = 'freepatent'
allowed_domains = ['freepatentsonline.com']
search_value = 'laptop'
start_urls = [f'https://www.freepatentsonline.com/result.html?sort=relevance&srch=top&query_txt={search_value}&submit=&patents_us=on']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'
def parse(self, response):
for data in response.xpath("//table[@class='listing_table']//td//a"):
title = data.xpath(".//text()").get()
related_link = data.xpath(".//@href").get()
absolute_url = f"https://www.freepatentsonline.com{related_link}"
yield{
'title':title,
'related_link':related_link,
'absolute_url':absolute_url
}