Access django models inside of Scrapy

Question:

Is it possible to access my django models inside of a Scrapy pipeline, so that I can save my scraped data straight to my model?

I’ve seen this, but I don’t really get how to set it up?

Asked By: imns

||

Answers:

Add DJANGO_SETTINGS_MODULE env in your scrapy project’s settings.py

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'your_django_project.settings'

Now you can use DjangoItem in your scrapy project.

Edit:
You have to make sure that the your_django_project projects settings.py is available in PYTHONPATH.

Answered By: Jet Guo

If anyone else is having the same problem, this is how I solved it.

I added this to my scrapy settings.py file:

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/path/to/django/project/')

Note: the path above is to your django project folder, not the settings.py file.

Now you will have full access to your django models inside of your scrapy project.

Answered By: imns

The opposite solution (setup scrapy in a django management command):

# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py 

from __future__ import absolute_import
from django.core.management.base import BaseCommand

class Command(BaseCommand):

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute()

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

and in django’s settings.py:

import os
os.environ['SCRAPY_SETTINGS_MODULE'] = 'scrapy_project.settings'

Then instead of scrapy foo run ./manage.py scrapy foo.

UPD: fixed the code to bypass django’s options parsing.

Answered By: Mikhail Korobov

For Django 1.4, the project layout has changed. Instead of /myproject/settings.py, the settings module is in /myproject/myproject/settings.py.

I also added path’s parent directory (/myproject) to sys.path to make it work correctly.

def setup_django_env(path):
    import imp, os, sys
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

    # Add path's parent directory to sys.path
    sys.path.append(os.path.abspath(os.path.join(path, os.path.pardir)))

setup_django_env('/path/to/django/myproject/myproject/')
Answered By: samwize

Check out django-dynamic-scraper, it integrates a Scrapy spider manager into a Django site.

https://github.com/holgerd77/django-dynamic-scraper

Answered By: Sectio Aurea

Why not create a __init__.py file in the scrapy project folder and hook it up in INSTALLED_APPS? Worked for me. I was able to simply use:

piplines.py

from my_app.models import MyModel

Hope that helps.

Answered By: Özer

setup-environ is deprecated. You may need to do the following in scrapy’s settings file for newer versions of django 1.4+

def setup_django_env():
    import sys, os, django

    sys.path.append('/path/to/django/myapp')
    os.environ['DJANGO_SETTINGS_MODULE'] = 'myapp.settings'

django.setup()
Answered By: Brayoni

Minor update to solve KeyError. Python(3)/Django(1.10)/Scrapy(1.2.0)

from django.core.management.base import BaseCommand

class Command(BaseCommand):    
    help = 'Scrapy commands. Accessible from: "Django manage.py". '

    def __init__(self, stdout=None, stderr=None, no_color=False):
        super().__init__(stdout=None, stderr=None, no_color=False)

        # Optional attribute declaration.
        self.no_color = no_color
        self.stderr = stderr
        self.stdout = stdout

        # Actual declaration of CLI command
        self._argv = None

    def run_from_argv(self, argv):
        self._argv = argv
        self.execute(stdout=None, stderr=None, no_color=False)

    def handle(self, *args, **options):
        from scrapy.cmdline import execute
        execute(self._argv[1:])

The SCRAPY_SETTINGS_MODULE declaration is still required.

os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'scrapy_project.settings')
Answered By: Siggy