python web crawler with thread support

Question:

these day im making some web crawler script, but one of problem is my internet is very slow.
so i was thought whether is it possible webcrawler with multithreading by use mechanize or urllib or so.
if anyone have experience ,share info much appreciate.
i was look for in google ,but not found much useful info.
Thanks in advance

Asked By: paul

Source

Answers:

There’s a good, simple example on this Stack Overflow thread.

Answered By: Alex Martelli

Making multiple requests to many websites at the same time will certainly improve your results, since you don’t have to wait for a result to arrive before sending new requests.

However threading is just one of the ways to do that (and a poor one, I might add). Don’t use threading for that. Just don’t wait for the response before sending another request! No need for threading to do that.

A good idea is to use scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It is written in python and can make many concurrent connections to fetch data at the same time (without using threads to do so). It is really fast. You can also study it to see how it is implemented.

Answered By: nosklo

Practical threaded programming with Python is worth reading.

Answered By: sunqiang