Update multiple objects at once in Django?

Question:

I am using Django 1.9. I have a Django table that represents the value of a particular measure, by organisation by month, with raw values and percentiles:

class MeasureValue(models.Model):
    org = models.ForeignKey(Org, null=True, blank=True)
    month = models.DateField()
    calc_value = models.FloatField(null=True, blank=True)
    percentile = models.FloatField(null=True, blank=True)

There are typically 10,000 or so per month. My question is about whether I can speed up the process of setting values on the models.

Currently, I calculate percentiles by retrieving all the measurevalues for a month using a Django filter query, converting it to a pandas dataframe, and then using scipy’s rankdata to set ranks and percentiles. I do this because pandas and rankdata are efficient, able to ignore null values, and able to handle repeated values in the way that I want, so I’m happy with this method:

records = MeasureValue.objects.filter(month=month).values()
df = pd.DataFrame.from_records(records)
// use calc_value to set percentile on each row, using scipy's rankdata

However, I then need to retrieve each percentile value from the dataframe, and set it back onto the model instances. Right now I do this by iterating over the dataframe’s rows, and updating each instance:

for i, row in df.iterrows():
    mv = MeasureValue.objects.get(org=row.org, month=month)
    if (row.percentile is None) or np.isnan(row.percentile):
        row.percentile = None
    mv.percentile = row.percentile
    mv.save()

This is unsurprisingly quite slow. Is there any efficient Django way to speed it up, by making a single database write rather than tens of thousands? I have checked the documentation, but can’t see one.

Asked By: Richard

||

Answers:

Atomic transactions can reduce the time spent in the loop:

from django.db import transaction

with transaction.atomic():
    for i, row in df.iterrows():
        mv = MeasureValue.objects.get(org=row.org, month=month)

        if (row.percentile is None) or np.isnan(row.percentile): 
            # if it's already None, why set it to None?
            row.percentile = None

        mv.percentile = row.percentile
        mv.save()

Django’s default behavior is to run in autocommit mode. Each query is immediately committed to the database, unless a transaction is actives.

By using with transaction.atomic() all the inserts are grouped into a single transaction. The time needed to commit the transaction is amortized over all the enclosed insert statements and so the time per insert statement is greatly reduced.

Answered By: ahmed

As of Django 2.2, you can use the bulk_update() queryset method to efficiently update the given fields on the provided model instances, generally with one query:

objs = [
    Entry.objects.create(headline='Entry 1'),
    Entry.objects.create(headline='Entry 2'),
]
objs[0].headline = 'This is entry 1'
objs[1].headline = 'This is entry 2'
Entry.objects.bulk_update(objs, ['headline'])

In older versions of Django you could use update() with Case/When, e.g.:

from django.db.models import Case, When

Entry.objects.filter(
    pk__in=headlines  # `headlines` is a pk -> headline mapping
).update(
    headline=Case(*[When(pk=entry_pk, then=headline)
                    for entry_pk, headline in headlines.items()]))
Answered By: Eugene Yarmash

Actually, attempting @Eugene Yarmash ‘s answer I found I got this error:

FieldError: Joined field references are not permitted in this query

But I believe iterating update is still quicker than multiple saves, and I expect using a transaction should also expedite.

So, for versions of Django that don’t offer bulk_update, assuming the same data used in Eugene’s answer, where headlines is a pk -> headline mapping:

from django.db import transaction

with transaction.atomic():
    for entry_pk, headline in headlines.items():
        Entry.objects.filter(pk=entry_pk).update(headline=headline)
Answered By: NeilG
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.