Django + Celery + Scrapy twisted reactor(ReactorNotRestartable) and database(SSL error) errors

Question:

I have a Django 2.0, Celery 4 and Scrapy 1.5 setup where I have a Spider inside a django custom command and I need to call this command at regular intervals, I use Celery to call these commands and they involve scraping a web page and saving some results to the database. Here are my files:

get_data.py

class Command(BaseCommand):
    help = 'Crawl for new data'

    def handle(self, *args, **options):
        settings = Settings()
        settings.setmodule(crawler_settings)
        process = CrawlerProcess(settings=settings)
        args = {some needed args}
        process.crawl(DataLogSpider, kwargs=args)
        process.start()

celery.py

os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'config.settings.local')

app = Celery('config')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()

@app.task(bind=True)
def debug_task(self):
    print('Request: {0!r}'.format(self.request))

tasks.py

@task()
def collect_data_information():
    call_command('get_data')

(Django) settings.py

CELERY_BROKER_URL = 'redis://localhost:6379/0'
CELERY_BEAT_SCHEDULE = {
    'task-get-logs': {
        'task': 'core.tasks.collect_data_information',
        'schedule': crontab(minute='*/15')  # every 15 minutes
    },
}

I’ve removed some imports and reduced the code for simplicity. The problem is that when I run my celery task, my spider will only execute the first time, the second time I get ReactorNotRestartable error. I understand that the problem comes from the Twisted reactor being restarted more than once, which is not possible. I’ve already looked into these questions 1, 2, 3 and many others involving the same error, but none of them considered the concurrency problem when using Django to save to the database.

When I tried applying their solution to my problem I receive a django.db.utils.OperationalError: SSL error: decryption failed or bad record mac. I’ve looked that up as well and it is caused by the multiple processes that open a database connection, which is actually hapenning because of their solution.

So my question boils down to: Is there a way to run Celery+Scrapy+Django without having problems with the Twisted reactor being opened and finished multiple times?

Asked By: Levi Moreira

||

Answers:

I’ve found a solution myself. I had to add the following to by celery settings file:

app.conf.update(
    worker_max_tasks_per_child=1,
    broker_pool_limit=None
)

This tells celery to start every task with a clean slate, therefore every task will be started in a new process and the ReactorNotRestartable problem won’t occur.

Answered By: Levi Moreira