Aggregating a windowed queryset in Django

Question:

Background

Suppose we have a set of questions, and a set of students that answered these questions.
The answers have been reviewed, and scores have been assigned, on some unknown range.

Now, we need to normalize the scores with respect to the extreme values within each question.
For example, if question 1 has a minimum score of 4 and a maximum score of 12, those scores would be normalized to 0 and 1 respectively. Scores in between are interpolated linearly (as described e.g. in Normalization to bring in the range of [0,1]).

Then, for each student, we would like to know the mean of the normalized scores for all questions combined.

Minimal example

Here’s a very naive minimal implementation, just to illustrate what we would like to achieve:

class Question(models.Model):
    pass


class Student(models.Model):
    def mean_normalized_score(self):
        normalized_scores = []
        for score in self.score_set.all():
            normalized_scores.append(score.normalized_value())
        return mean(normalized_scores) if normalized_scores else None


class Score(models.Model):
    student = models.ForeignKey(to=Student, on_delete=models.CASCADE)
    question = models.ForeignKey(to=Question, on_delete=models.CASCADE)
    value = models.FloatField()

    def normalized_value(self):
        limits = Score.objects.filter(question=self.question).aggregate(
            min=models.Min('value'), max=models.Max('value'))
        return (self.value - limits['min']) / (limits['max'] - limits['min'])

This works well, but it is quite inefficient in terms of database queries, etc.

Goal

Instead of the implementation above, I would prefer to offload the number-crunching on to the database.

What I’ve tried

Consider, for example, these two use cases:

  1. list the normalized_value for all Score objects
  2. list the mean_normalized_score for all Student objects

The first use case can be covered using window functions in a query, something like this:

w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])
annotated_scores = Score.objects.annotate(
    normalized_value=(F('value') - w_min) / (w_max - w_min))

This works nicely, so the Score.normalized_value() method from the example is no longer needed.

Now, I would like to do something similar for the second use case, to replace the Student.mean_normalized_score() method by a single database query.

The raw SQL could look something like this (for sqlite):

SELECT id, student_id, AVG(normalized_value) AS mean_normalized_score
FROM (
    SELECT
        myapp_score.*,
        ((myapp_score.value - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)) / (MAX(myapp_score.value) OVER (PARTITION BY myapp_score.question_id) - MIN(myapp_score.value) OVER (PARTITION BY myapp_score.question_id)))
        AS normalized_value
    FROM myapp_score
    ) 
GROUP BY student_id

I can make this work as a raw Django query, but I have not yet been able to reproduce this query using Django’s ORM.

I’ve tried building on the annotated_scores queryset described above, using Django’s Subquery, annotate(), aggregate(), Prefetch, and combinations of those, but I must be making a mistake somewhere.

Probably the closest I’ve gotten is this:

subquery = Subquery(annotated_scores.values('normalized_value'))
Score.objects.values('student_id').annotate(mean=Avg(subquery))

But this is incorrect.

Could someone point me in the right direction, without resorting to raw queries?

Asked By: djvg

||

Answers:

I may have found a way to do this using subqueries. The main thing is at least from django, we cannot use the window functions on aggregates, so that’s what is blocking the calculation of the mean of the normalized values. I’ve added comments on the lines to explain what I’m trying to do:

# Get the minimum score per question
min_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(min=Min('value'))

# Get the maximum score per question
max_subquery = Score.objects.filter(question=OuterRef('question')).values('question').annotate(max=Max('value'))

# Calculate the normalized value per score, then get the average by grouping by students
mean_subquery = Score.objects.filter(student=OuterRef('pk')).annotate(
    min=Subquery(min_subquery.values('min')[:1]), 
    max=Subquery(max_subquery.values('max')[:1]), 
    normalized=ExpressionWrapper((F('value') - F('min'))/(F('max') - F('min')), output_field=FloatField())
).values('student').annotate(mean=Avg('normalized'))

# Get the calculated mean per student
Student.objects.annotate(mean=Subquery(mean_subquery.values('mean')[:1]))

The resulting SQL is:

SELECT 
  "student"."id", 
  "student"."name", 
  (
    SELECT 
      AVG(
        (
          (
            V0."value" - (
              SELECT 
                MIN(U0."value") AS "min" 
              FROM 
                "score" U0 
              WHERE 
                U0."question_id" = (V0."question_id") 
              GROUP BY 
                U0."question_id" 
              LIMIT 
                1
            )
          ) / (
            (
              SELECT 
                MAX(U0."value") AS "max" 
              FROM 
                "score" U0 
              WHERE 
                U0."question_id" = (V0."question_id") 
              GROUP BY 
                U0."question_id" 
              LIMIT 
                1
            ) - (
              SELECT 
                MIN(U0."value") AS "min" 
              FROM 
                "score" U0 
              WHERE 
                U0."question_id" = (V0."question_id") 
              GROUP BY 
                U0."question_id" 
              LIMIT 
                1
            )
          )
        )
      ) AS "mean" 
    FROM 
      "score" V0 
    WHERE 
      V0."student_id" = ("student"."id") 
    GROUP BY 
      V0."student_id" 
    LIMIT 
      1
  ) AS "mean" 
FROM 
  "student"
Answered By: Brian Destura

As mentioned by @bdbd, and judging from this Django issue, it appears that annotating a windowed queryset is not yet possible (using Django 3.2).

As a temporary workaround, I refactored @bdbd’s excellent Subquery solution as follows.

class ScoreQuerySet(models.QuerySet):
    def annotate_normalized(self):
        w_min = Subquery(self.filter(
            question=OuterRef('question')).values('question').annotate(
            min=Min('value')).values('min')[:1])
        w_max = Subquery(self.filter(
            question=OuterRef('question')).values('question').annotate(
            max=Max('value')).values('max')[:1])
        return self.annotate(normalized=(F('value') - w_min) / (w_max - w_min))

    def aggregate_student_mean(self):
        return self.annotate_normalized().values('student_id').annotate(
            mean=Avg('normalized'))


class Score(models.Model):
    objects = ScoreQuerySet.as_manager()
    ...

Note: If necessary, we can add more Student lookups to the values() in aggregate_student_mean(), e.g. student__name. As long as we take care not to mess up the grouping.

Now, if it ever becomes possible to annotate windowed querysets, we can simply replace the Subquery lines by the much simpler Window implementation:

w_min = Window(expression=Min('value'), partition_by=[F('question')])
w_max = Window(expression=Max('value'), partition_by=[F('question')])

EDIT: Django 4.2 now supports filtering windowed querysets, but still does not allow annotations.

Answered By: djvg
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.