how to use scrapy package with Juypter Notebook

Question:

i’m trying to learn web scraping/crawling and trying to apply the below code on Juypter Notebook but it didn’t show any outputs, So can anyone help me and guide me to how to use scrapy package on Juypter notbook.

The code:-

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BooksCrawlSpider(CrawlSpider):
    name = 'books_crawl'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['https://books.toscrape.com/catalogue/category/books/sequential-art_5/page-1.html']

    le_book_details = LinkExtractor(restrict_css='h3 > a')
    le_next = LinkExtractor(restrict_css='.next > a')  # next_button
    le_cats = LinkExtractor(restrict_css='.side_categories > ul > li > ul > li a')  # Categories

    rule_book_details = Rule(le_book_details, callback='parse_item', follow=False)
    rule_next = Rule(le_next, follow=True)
    rule_cats = Rule(le_cats, follow=True)

    rules = (
        rule_book_details,
        rule_next,
        rule_cats
    )

    def parse_item(self, response):
        yield {
            'Title': response.css('h1 ::text').get(),
            'Category': response.xpath('//ul[@class="breadcrumb"]/li[last()-1]/a/text()').get(),
            'Link': response.url
        }

The final result is without any output:-

enter image description here

Asked By: Mahmoud Badr

||

Answers:

To run your spider you can add the following snippet in a new cell:

from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()
process.crawl(BooksCrawlSpider)
process.start()

More details on the Scrapy docs


Edit:

A solution to create a dataframe from the extracted items would be first exporting the output to a file (eg. .CSV), by passing the settings parameter to CrawlerProcess:

process = CrawlerProcess(settings={
    "FEEDS": {
        "items.csv": {"format": "csv"},
    },
})

Then open it with pandas:

df = pd.read_csv("items.csv")
Answered By: Thiago Curvelo