Difference between BeautifulSoup and Scrapy crawler?

Question:

I want to make a website that shows the comparison between amazon and e-bay product price.
Which of these will work better and why? I am somewhat familiar with BeautifulSoup but not so much with Scrapy crawler.

Asked By: Nishant Bhakta

||

Answers:

I think both are good… im doing a project right now that use both. First i scrap all the pages using scrapy and save that on a mongodb collection using their pipelines, also downloading the images that exists on the page.
After that i use BeautifulSoup4 to make a pos-processing where i must change attributes values and get some special tags.

If you don’t know which pages products you want, a good tool will be scrapy since you can use their crawlers to run all amazon/ebay website looking for the products without making a explicit for loop.

Take a look at the scrapy documentation, it’s very simple to use.

Answered By: rdenadai

Scrapy is a Web-spider or web scraper framework, You give Scrapy a root URL to start crawling, then you can specify constraints on how many (number of) URLs you want to crawl and fetch,etc. It is a complete framework for web-scraping or crawling.

While

BeautifulSoup is a parsing library which also does a pretty good job of fetching contents from URL and allows you to parse certain parts of them without any hassle. It only fetches the contents of the URL that you give and then stops. It does not crawl unless you manually put it inside an infinite loop with certain criteria.

In simple words, with Beautiful Soup you can build something similar to Scrapy.
Beautiful Soup is a library while Scrapy is a complete framework.

Source

Answered By: Medeiros

The way I do it is to use the eBay/Amazon API’s rather than scrapy, and then parse the results using BeautifulSoup.

The APIs gives you an official way of getting the same data that you would have got from scrapy crawler, with no need to worry about hiding your identity, mess about with proxies,etc.

Answered By: baldnbad

Both are using to parse data.

Scrapy:

  • Scrapy is a fast high-level web crawling and web scraping framework,
    used to crawl websites and extract structured data from their pages.
  • But it has some limitations when data comes from java script or
    loading dynamicaly, we can over come it by using packages like splash,
    selenium etc.

BeautifulSoup:

  • Beautiful Soup is a Python library for pulling data out of HTML and
    XML files.

  • we can use this package for getting data from java script or
    dynamically loading pages.

Scrapy with BeautifulSoup is one of the best combo we can work with for scraping static and dynamic contents

Answered By: Arun Augustine

Using scrapy you can save tons of code and start with structured programming, If you dont like any of the scapy’s pre-written methods then BeautifulSoup can be used in the place of scrapy method.
Big project takes both advantages.

Answered By: ethirajit

The differences are many and selection of any tool/technology depends on individual needs.

Few major differences are:

  1. BeautifulSoup is comparatively is easy to learn than Scrapy.
  2. The extensions, support, community is larger for Scrapy than for BeautifulSoup.
  3. Scrapy should be considered as a Spider while BeautifulSoup is a Parser.
Answered By: krish___na

Scrapy
It is a web scraping framework which comes with tons of goodies which make scraping from easier so that we can focus on crawling logic only. Some of my favourite things scrapy takes care for us are below.

  • Feed exports: It basically allows us to save data in various formats like CSV,JSON,jsonlines and XML.
  • Asynchronous scraping: Scrapy uses twisted framework which gives us power to visit multiple urls at once where each request is processed in non blocking way(Basically we don’t have to wait for a request to finish before sending another request).
  • Selectors: This is where we can compare scrapy with beautiful soup. Selectors are what allow us to select particular data from the webpage like heading, certain div with a class name etc.). Scrapy uses lxml for parsing which is extremely fast than beautiful soup.
  • Setting proxy,user agent ,headers etc: scrapy allows us to set and rotate proxy,and other headers dynamically.

  • Item Pipelines: Pipelines enable us to process data after extraction. For example we can configure pipeline to push data to your mysql server.

  • Cookies: scrapy automatically handles cookies for us.

etc.

TLDR: scrapy is a framework that provides everything that one might
need to build large scale crawls. It provides various features that
hide complexity of crawling the webs. one can simply start writing web
crawlers without worrying about the setup burden.

Beautiful soup
Beautiful Soup is a Python package for parsing HTML and XML documents. So with Beautiful soup you can parse a webpage that has been already downloaded. BS4 is very popular and old. Unlike scrapy,You cannot use beautiful soup only to make crawlers. You will need other libraries like requests,urllib etc to make crawlers with bs4. Again, this means you would need to manage the list of urls being crawled,to be crawled, handle cookies , manage proxy, handle errors, create your own functions to push data to CSV,JSON,XML etc. If you want to speed up than you will have to use other libraries like multiprocessing.

To sum up.

  • Scrapy is a rich framework that you can use to start writing crawlers
    without any hassale.

  • Beautiful soup is a library that you can use to parse a webpage. It
    cannot be used alone to scrape web.

You should definitely use scrapy for your amazon and e-bay product price comparison website. You could build a database of urls and run the crawler every day(cron jobs,Celery for scheduling crawls) and update the price on your database.This way your website will always pull from the database and crawler and database will act as individual components.

Answered By: Amit

BeautifulSoup is a library that lets you extract information from a web page.

Scrapy on the other hand is a framework, which does the above thing and many more things you probably need in your scraping project like pipelines for saving data.

You can check this blog to get started with Scrapy
https://www.inkoop.io/blog/web-scraping-using-python-and-scrapy/

Answered By: Jaskaran Singh

Beautifulsoup is web scraping small library. it does your job but sometime it does not satisfy your needs.i mean if you scrape websites in large amount of data so here in this case beautifulsoup fails.

In this case you should use Scrapy which is a complete scraping framework which will do you job.
Also scrapy has support for databases(all kind of databases) so it is a huge
of scrapy over other web scraping libraries.

Answered By: Danish Khan

Long story short:

Scrapy is a multitool. BS4 is a penknife.

Now a list of peculiarities for each one from personal experience:

Scrapy:

  1. heavy
  2. issues with installing dependencies might occur
  3. takes time to master
  4. well supported and documented, always up to date with a large and active community
  5. fast to extract
  6. good for large jobs
  7. has a native cloud (can deploy code into the cloud and forget about it until done)
  8. has a native API
  9. variety of settings and add-ons (middlewares), allows you to fine-tune your code to the slightest details.
  10. highly structured code
  11. easily integrates with residential proxies, and offers middleware for IP rotation.
  12. handy informative(cloud) interface, handy for debugginginterface

bs4:

  1. lightweight
  2. fast to install
  3. fast to learn
  4. fast and dirty to code
  5. is suitable for simple tasks
  6. is suitable for testing sites and hypothesis
  7. can use curl from chrome dev tools and convert curl to requests and use the result directly in your code for cookie-dependent sites or complex post requests.

Summary:
Use bs4 If you are just starting or using scraping once in a while for small projects.

Use Scrapy if you are a professional web scraper that has to deal with large-scale data collection, and have to run the scraper for a long time.

Answered By: Nagasta Begamba