scrapy | py4u

Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located

Scrapy Playwright Page Method: Prevent timeout error if selector cannot be located Question: My question is related to Scrapy Playwright and how to prevent the Page of a Spider from crashing, if in the course of applying a PageMethod a specific selector cannot be located. Below is a Scrapy Spider that uses Playwright to interact …

Total answers: 2

Scrapy : ValueError: XPath error: Invalid expression

Scrapy : ValueError: XPath error: Invalid expression Question: I am trying to learn scrapy for a project. I receive this error ValueError: XPath error: Invalid expression but I don’t understand what is wrong in my script. It’s my script def parse(self, response): yield { ‘user_agent’: str(response.request.headers[‘User-Agent’]), ‘links’ : response.xpath(‘//a[@class="sc-996f251d-0 leAMGT"]/@href’).getall() } next = response.xpath(‘//a[@title="Page suivante"]/’) …

Total answers: 1

Scrapy download Images and rename the image as md5 hash

Scrapy download Images and rename the image as md5 hash Question: I have a Scrapy spider which is working as far as scraping is concerned but I am having issues during downloading the images. I want to download the images and rename them as md5 hash for example: c69/96d/f0d/c6996df0d9d852f1f39fcb7074ace625.jpg also I’d like to add the …

Total answers: 1

What's the difference between scrapy.cmdline.execute and executing a shell command, when running a scrapy spider in a python script?

What's the difference between scrapy.cmdline.execute and executing a shell command, when running a scrapy spider in a python script? Question: When I want to run a scrapy spider, I could do it by calling either scrapy.cmdline.execute([‘scrapy’, ‘crawl’, ‘myspider’]) or os.system(‘scrapy crawl myspider’) or subprocess.run([‘scrapy’, ‘crawl’, ‘myspider’]). My question is: Why would I prefer to use …

Total answers: 1

Scraping h5 header text in between div tags

Scraping h5 header text in between div tags Question: I am trying to attempt webscraping product prices from this website. How would I go around getting a text value inside a h4 heading in between div classes? HTML: <div class="product-item"> <a href="/product-catalogue?pid=6963"> <div class="list-item-image"> <img src="https://app.digitalconcept.mn/upload/media/product/0001/05/thumb_4760_product_thumb.png" alt="Кофе Bestcup rich creamy 3NI1 1ш"> </div> <h5>Кофе Bestcup …

Total answers: 1

BS Extract all text between two specified keyword

BS Extract all text between two specified keyword Question: With Python and BS i need to extract all text contained between two specified word blabla text i need blibli I succeed to extract inside DIV and TAG but not for specific and different keyword. Thank you for your help Asked By: steve figueras || Source …

Total answers: 4

Unable to scrape items using scrapy [solved]

Unable to scrape items using scrapy [solved] Question: I am trying to webscrape the name, price, and description of products listed on an online shop. The website link is https://eshop.nomin.mn/n-foods.html When I look through the HTML code of the page, I get the relevant div class containers but when I reference it in my code …

Total answers: 2

No impact of payload on scrappy request

No impact of payload on scrappy request Question: I am facing strange issue. url = ["https://nr.aws-achat.info/_extranet/index.cfm?fuseaction=mEnt.lister"] payload = { ‘rechInputCPV’:’03000000-1′, ‘rechInputMetier’:”, ‘texte’:”, ‘btnSub’:’Afficher’ } yield scrapy.Request(url[0],method=’POST’,body=json.dumps(payload),callback=self.parse) in above case of scrappy request, the response is same as if I pass payload as blank dict. Expectation : if pass rechInputCPV’:’03000000- I should get 60 rows of …

Total answers: 1

CrawlerProcess – run from manager and get stats from Spider

CrawlerProcess – run from manager and get stats from Spider Question: I’m trying to create a manager for my spiders and record the stats from each crawl job to a sqlite db, unfortunately I can’t manage to run the crawlers with CrawlerProcess from a separate python script. I’ve been looking for possible answers but there …

Total answers: 1

Multiple span tag under one parent DIV id always returns first record

Multiple span tag under one parent DIV id always returns first record Question: I have multiple span tag with same class name under one parent div id. But, the BeautifulSoup item loop always returns first attribute only, rest of the attributes are not printing. Note : All of my span class names are same as …

Total answers: 1