Access Django models with scrapy: defining path to Django project

Question:

I’m very new to Python and Django. I’m currently exploring using Scrapy to scrape sites and save data to the Django database. My goal is to run a spider based on domain given by a user.

I’ve written a spider that extracts the data I need and store it correctly in a json file when calling

scrapy crawl spider -o items.json -t json

As described in the scrapy tutorial.

My goal is now to get the spider to succesfully to save data to the Django database, and then work on getting the spider to run based on user input.

I’m aware that various posts exists on this subject, such as these:
link 1
link 2
link 3

But having spend more than 8 hours on trying to get this to work, I’m assuming i’m not the only one still facing issues with this. I’ll therefor try and gather all the knowledge i’ve gotten so far in this post, as well a hopefully posting a working solution at a later point. Because of this, this post is rather long.

It appears to me that there is two different solutions to saving data to the Django database from Scrapy. One is to use DjangoItem, another is to to import the models directly(as done here).

I’m not completely aware of the advantages and disadvantages of these two, but it seems like the difference is simply the using DjangoItem is just more convenient and shorter.

What i’ve done:

I’ve added:

def setup_django_env(path):
    import imp, os
    from django.core.management import setup_environ

    f, filename, desc = imp.find_module('settings', [path])
    project = imp.load_module('settings', f, filename, desc)       

    setup_environ(project)

setup_django_env('/Users/Anders/DjangoTraining/wsgi/')

Error i’m getting is:

ImportError: No module named settings

I’m thinking i’m defining the path to my Django project in a wrong way?

I’ve also tried the following:

setup_django_env('../../') 

How do I define the path to my Django project correctly? (if that is the issue)

Asked By: Splurk

||

Answers:

I think the main misconception is the package path vs the settings module path. In order to use django’s models from an external script you need to set the DJANGO_SETTINGS_MODULE. Then, this module has to be importable (i.e. if the settings path is myproject.settings, then the statement from myproject import settings should work in a python shell).

As most projects in django are created in a path outside the default PYTHONPATH, you must add the project’s path to the PYTHONPATH environment variable.

Here is a step-by-step guide to create a fully working (and minimal) Django models integration into a Scrapy project:

Note: This instructions work at the date of the last edit. If it doesn’t work for you, please add a comment and describe your issue and scrapy/django versions.

  1. The projects will be created within /home/rolando/projects directory.

  2. Start the django project.

    $ cd ~/projects
    $ django-admin startproject myweb
    $ cd myweb
    $ ./manage.py startapp myapp
    
  3. Create a model in myapp/models.py.

    from django.db import models
    
    
    class Person(models.Model):
        name = models.CharField(max_length=32)
    
  4. Add myapp to INSTALLED_APPS in myweb/settings.py.

    # at the end of settings.py
    INSTALLED_APPS += ('myapp',)
    
  5. Set my db settings in myweb/settings.py.

    # at the end of settings.py
    DATABASES['default']['ENGINE'] = 'django.db.backends.sqlite3'
    DATABASES['default']['NAME'] = '/tmp/myweb.db'
    
  6. Create the database.

    $ ./manage.py syncdb --noinput
    Creating tables ...
    Installing custom SQL ...
    Installing indexes ...
    Installed 0 object(s) from 0 fixture(s)
    
  7. Create the scrapy project.

    $ cd ~/projects
    $ scrapy startproject mybot
    $ cd mybot
    
  8. Create an item in mybot/items.py.

Note: In newer versions of Scrapy, you need to install scrapy_djangoitem and use from scrapy_djangoitem import DjangoItem.

    from scrapy.contrib.djangoitem import DjangoItem
    from scrapy.item import Field

    from myapp.models import Person


    class PersonItem(DjangoItem):
        # fields for this item are automatically created from the django model
        django_model = Person

The final directory structure is this:

/home/rolando/projects
├── mybot
│   ├── mybot
│   │   ├── __init__.py
│   │   ├── items.py
│   │   ├── pipelines.py
│   │   ├── settings.py
│   │   └── spiders
│   │       └── __init__.py
│   └── scrapy.cfg
└── myweb
    ├── manage.py
    ├── myapp
    │   ├── __init__.py
    │   ├── models.py
    │   ├── tests.py
    │   └── views.py
    └── myweb
        ├── __init__.py
        ├── settings.py
        ├── urls.py
        └── wsgi.py

From here, basically we are done with the code required to use the django models in a scrapy project. We can test it right away using scrapy shell command but be aware of the required environment variables:

$ cd ~/projects/mybot
$ PYTHONPATH=~/projects/myweb DJANGO_SETTINGS_MODULE=myweb.settings scrapy shell

# ... scrapy banner, debug messages, python banner, etc.

In [1]: from mybot.items import PersonItem

In [2]: i = PersonItem(name='rolando')

In [3]: i.save()
Out[3]: <Person: Person object>

In [4]: PersonItem.django_model.objects.get(name='rolando')
Out[4]: <Person: Person object>

So, it is working as intended.

Finally, you might not want to have to set the environment variables each time you run your bot. There are many alternatives to address this issue, although the best it is that the projects’ packages are actually installed in a path set in PYTHONPATH.

This is one of the simplest solutions: add this lines to your mybot/settings.py file to set up the environment variables.

# Setting up django's project full path.
import sys
sys.path.insert(0, '/home/rolando/projects/myweb')

# Setting up django's settings module name.
# This module is located at /home/rolando/projects/myweb/myweb/settings.py.
import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'myweb.settings'

# Since Django 1.7, setup() call is required to populate the apps registry.
import django; django.setup()

Note: A better approach to the path hacking is to have setuptools-based setup.py files in both projects and run python setup.py develop which will link your project path into the python’s path (I’m assuming you use virtualenv).

That is enough. For completeness, here is a basic spider and pipeline for a fully working project:

  1. Create the spider.

    $ cd ~/projects/mybot
    $ scrapy genspider -t basic example example.com
    

    The spider code:

    # file: mybot/spiders/example.py
    from scrapy.spider import BaseSpider
    from mybot.items import PersonItem
    
    
    class ExampleSpider(BaseSpider):
        name = "example"
        allowed_domains = ["example.com"]
        start_urls = ['http://www.example.com/']
    
        def parse(self, response):
            # do stuff
            return PersonItem(name='rolando')
    
  2. Create a pipeline in mybot/pipelines.py to save the item.

    class MybotPipeline(object):
        def process_item(self, item, spider):
            item.save()
            return item
    

    Here you can either use item.save() if you are using the DjangoItem class or import the django model directly and create the object manually. In both ways the main issue is to define the environment variables so you can use the django models.

  3. Add the pipeline setting to your mybot/settings.py file.

    ITEM_PIPELINES = {
        'mybot.pipelines.MybotPipeline': 1000,
    }
    
  4. Run the spider.

    $ scrapy crawl example
    
Answered By: R. Max

Even though Rho’s answer seems very good I thought I’d share how I got scrapy working with Django Models (aka Django ORM) without a full blown Django project since the question only states the use of a “Django database”. Also I do not use DjangoItem.

The following works with Scrapy 0.18.2 and Django 1.5.2. My scrapy project is called scrapping in the following.

  1. Add the following to your scrapy settings.py file

    from django.conf import settings as d_settings
    d_settings.configure(
        DATABASES={
            'default': {
                'ENGINE': 'django.db.backends.postgresql_psycopg2',
                'NAME': 'db_name',
                'USER': 'db_user',
                'PASSWORD': 'my_password',
                'HOST': 'localhost',  
                'PORT': '',
            }},
        INSTALLED_APPS=(
            'scrapping',
        )
    )
    
  2. Create a manage.py file in the same folder as your scrapy.cfg:
    This file is not needed when you run the spider itself but is super convenient for setting up the database. So here we go:

    #!/usr/bin/env python
    import os
    import sys
    
    if __name__ == "__main__":
        os.environ.setdefault("DJANGO_SETTINGS_MODULE", "scrapping.settings")
    
        from django.core.management import execute_from_command_line
    
        execute_from_command_line(sys.argv)
    

    That’s the entire content of manage.py and is pretty much exactly the stock manage.py file you get after running django-admin startproject myweb but the 4th line points to your scrapy settings file.
    Admittedly, using DJANGO_SETTINGS_MODULE and settings.configure seems a bit odd but it works for the one manage.py commands I need: $ python ./manage.py syncdb.

  3. Your models.py
    Your models.py should be placed in your scrapy project folder (ie. scrapping.models´).
    After creating that file you should be able to run you
    $ python ./manage.py syncdb`. It may look like this:

    from django.db import models
    
    class MyModel(models.Model):
        title = models.CharField(max_length=255)
        description = models.TextField()
        url = models.URLField(max_length=255, unique=True)
    
  4. Your items.py and pipeline.py:
    I used to use DjangoItem as descriped in Rho’s answer but I ran into trouble with it when running many crawls in parallel with scrapyd and using Postgresql. The exception max_locks_per_transaction was thrown at some point breaking all the running crawls. Furthermore, I did not figure out how to properly roll back a failed item.save() in the pipeline. Long story short, I ended up not using DjangoItem at all which solved all my problems. Here is how:
    items.py:

    from scrapy.item import Item, Field
    
    class MyItem(Item):
        title = Field()
        description = Field()
        url = Field()
    

    Note that the fields need to have the same name as in the model if you want to unpack them conveniently as in the next step!
    pipelines.py:

    from django.db import transaction
    from models import MyModel
    class Django_pipeline(object):
        def process_item(self, item, spider):
            with transaction.commit_on_success():
                scraps = MyModel(**item)
                scraps.save()
            return item
    

    As mentioned above, if you named all your item fields like you did in your models.py file you can use **item to unpack all the fields when creating your MyModel object.

That’s it!

Answered By: Chris