web-crawler

How can I crawl the product items from shopee website?

How can I crawl the product items from shopee website? Question: I try to use python to get the product information like Name and Price. But this time doesn’t work, even I check the html code via web-browser programmer mode to get the class name and try to use this name to get anything what …

Total answers: 1

Scrapy CrawlerRunner: Output missing

Scrapy CrawlerRunner: Output missing Question: I have been using the method described on stackoverflow (https://stackoverflow.com/a/43661172/5037146) , to make scrapy run from script using Crawler Runner to allow to restart the process. However, I don’t get any console logs when running the process through CrawlerRunner, whereas when I using CrawlerProcess, it outputs the status and progress. …

Total answers: 2

Crawling IMDB for movie trailers?

Crawling IMDB for movie trailers? Question: I want to crawl IMDB and download the trailers of movies (either from YouTube or IMDB) that fit some criteria (e.g.: released this year, with a rating above 2). I want to do this in Python – I saw that there were packages for crawling IMDB and downloading YouTube …

Total answers: 3

Scrapy – Reactor not Restartable

Scrapy – Reactor not Restartable Question: with: from twisted.internet import reactor from scrapy.crawler import CrawlerProcess I’ve always ran this process sucessfully: process = CrawlerProcess(get_project_settings()) process.crawl(*args) # the script will block here until the crawling is finished process.start() but since I’ve moved this code into a web_crawler(self) function, like so: def web_crawler(self): # set up a …

Total answers: 6

getting Forbidden by robots.txt: scrapy

getting Forbidden by robots.txt: scrapy Question: while crawling website like https://www.netflix.com, getting Forbidden by robots.txt: https://www.netflix.com/> ERROR: No response downloaded for: https://www.netflix.com/ Asked By: deepak kumar || Source Answers: First thing you need to ensure is that you change your user agent in the request, otherwise default user agent will be blocked for sure. Answered …

Total answers: 3

Scraping javascript rendered HTML page in python

Scraping javascript rendered HTML page in python Question: I am scraping a website using python, but the website is being rendered with javascript and all the links are coming from javascript. So when I use request.get(url) it’s only giving the source code, not the other links that are generated with javascript. Is there any way …

Total answers: 2

Why do I get a "Connection aborted" error when trying to crawl a specific website?

Why do I get a "Connection aborted" error when trying to crawl a specific website? Question: I wrote a Web crawler in Python 2.7, but a specific site cannot be downloaded although it can be viewed in browser. My code is as following: # -*- coding: utf-8 -*- import requests # OK url = ‘http://blog.ithome.com.tw/’ …

Total answers: 2

Passing arguments to process.crawl in Scrapy python

Passing arguments to process.crawl in Scrapy python Question: I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json My script is as follows : import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings spider = LinkedInAnonymousSpider(None, “James”, “Bond”) …

Total answers: 3

how to extract asin from an amazon product page

how to extract asin from an amazon product page Question: I have the following webpage Product page and I’m trying to get the ASIN from it (in this case ASIN=B014MHZ90M) and I don’t have a clue on how to get it from the page. I’m using Python 3.4, Scrapy and the following code: hxs = …

Total answers: 5

TypeError: can't use a string pattern on a bytes-like object in re.findall()

TypeError: can't use a string pattern on a bytes-like object in re.findall() Question: I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage: import urllib.request import re url = “http://www.google.com” regex = r'<title>(,+?)</title>’ pattern = re.compile(regex) with urllib.request.urlopen(url) …

Total answers: 4