Incorrect results with `annotate` + `values` + `union` in Django

Question:

Jump to edit to see more real-life code example, that doesn’t work after changing the query order

Here are my models:

class ModelA(models.Model):
    field_1a = models.CharField(max_length=32)
    field_2a = models.CharField(max_length=32)


class ModelB(models.Model):
    field_1b = models.CharField(max_length=32)
    field_2b = models.CharField(max_length=32)

Now, create 2 instances each:

ModelA.objects.create(field_1a="1a1", field_2a="1a2")
ModelA.objects.create(field_1a="2a1", field_2a="2a2")
ModelB.objects.create(field_1b="1b1", field_2b="1b2")
ModelB.objects.create(field_1b="2b1", field_2b="2b2")

If I’ll query for only one model with annotations, I get something like that:

>>> ModelA.objects.all().annotate(field1=F("field_1a"), field2=F("field_2a")).values("field1", "field2")
[{"field1": "1a1", "field2": "1a2"}, {"field1": "2a1", "field2": "2a2"}]

This is correct behavior. The problem starts, when I want to get union of those two models:

# model A first, with annotate
query = ModelA.objects.all().annotate(field1=F("field_1a"), field2=F("field_2a"))
# now union with model B, also annotated
query = query.union(ModelB.objects.all().annotate(field1=F("field_1b"), field2=F("field_2b")))
# get only field1 and field2
query = query.values("field1", "field2")

# the results are skewed:
assert list(query) == [
    {"field1": 1, "field2": "1a1"},
    {"field1": 1, "field2": "1b1"},
    {"field1": 2, "field2": "2a1"},
    {"field1": 2, "field2": "2b1"},
]

The assert passes correctly, which means that the results are wrong. It seems like the values() didn’t match the variable name, it just iterated over the object as on a tuple. The value of field1 is actually the object’s ID, and field2 is field1.

This is pretty easy to fix in such simple models, but my real models are quite complex, and they have a different number of fields. How do I union them correctly?

EDIT

Below you can find an extended example that fails regardless of the order of union() and values() – the models are slightly bigger now, and it seems that the different fields count somehow confuses Django:

# models

class ModelA(models.Model):
    field_1a = models.CharField(max_length=32)
    field_1aa = models.CharField(max_length=32, null=True)
    field_1aaa = models.CharField(max_length=32, null=True)
    field_2a = models.CharField(max_length=32)
    extra_a = models.CharField(max_length=32)


class ModelB(models.Model):
    extra = models.CharField(max_length=32)
    field_1b = models.CharField(max_length=32)
    field_2b = models.CharField(max_length=32)
# test

ModelA.objects.create(field_1a="1a1", field_2a="1a2", extra_a="1extra")
    ModelA.objects.create(field_1a="2a1", field_2a="2a2", extra_a="2extra")
    ModelB.objects.create(field_1b="1b1", field_2b="1b2", extra="3extra")
    ModelB.objects.create(field_1b="2b1", field_2b="2b2", extra="4extra")

    values = ("field1", "field2", "extra")

    query = (
        ModelA.objects.all()
        .annotate(
            field1=F("field_1a"), field2=F("field_2a"), extra=F("extra_a")
        )
        .values(*values)
    )
    query = query.union(
        ModelB.objects.all()
        .annotate(field1=F("field_1b"), field2=F("field_2b"))
        .values(*values)
    )
# outcome

assert list(query) == [
        {"field1": "1a1", "field2": "1a2", "extra": "1extra"},
        {"field1": "2a1", "field2": "2a2", "extra": "2extra"},
        {"field1": "3extra", "field2": "1b1", "extra": "1b2"},
        {"field1": "4extra", "field2": "2b1", "extra": "2b2"},
    ]
Asked By: Djent

||

Answers:

I dug through the docs and must admit I did not fully understand why your approach did not work (according to my understanding it should). What I believe is that applying union to querysets with different field names seems to be producing strange effects.

Anyway, applying values before doing the union seems to produce correct results:

query = ModelA.objects.all().annotate(field1=F("field_1a"), field2=F("field_2a")).values('field1', 'field2')
query = query.union(ModelB.objects.all().annotate(field1=F("field_1b"), field2=F("field_2b")).values('field1', 'field2'))

Which results in this queryset

[
    {'field1': '1a1', 'field2': '1a2'}, 
    {'field1': '1b1', 'field2': '1b2'}, 
    {'field1': '2a1', 'field2': '2a2'}, 
    {'field1': '2b1', 'field2': '2b2'}
]
Answered By: Chris

After some debugging and going through the source code, I have an idea why this is happening. What I am going to do is try to explain that why doing annotate + values results in displaying the id and what is the difference between the two cases above.

To keep things simple, I will write also write the possible resulting sql query for each statement.

1. annotate first but get values on union query

qs1 = ModelA.objects.all().annotate(field1=F("field_1a"), field2=F("field_2a"))

When writing something like this, django will get all the fields + annotated fields, so the resulting sql query looks like:

select id, field_1a, field_2a, field_1a as field1, field_2a as field2 from ModelA

So, if we have a query which is the result of:

qs = qs1.union(qs2)

the resulting sql for django looks like:

(select id, field_1a, field_2a, field_1a as field1, field_2a as field2 from ModelA)
UNION
(select id, field_1b, field_2b, field_1b as field1, field_2b as field2 from ModelB)

Let’s go deeper into how this sql is generated. When we do a union, a combinator and combined_queries is set on the qs.query and the resulting sql is generated by combining the sql of individual queries. So, in summary:

qs.sql == qs1.sql UNION qs2.sql # in abstract sense

When, we do qs.values('field1', 'field2'), the col_count in compiler is set to 2 which is the number of fields. As you can see that the union query above returns 5 columns but in the final return from compiler each row in the results is sliced using col_count. Now, this results with only 2 columns is passed back to ValuesIterable where it maps each name in the selected fields with the resulting columns. That is how it leads to the incorrect results.

2. annotate + values on individual queries and then perform union

Now, let’s see what happens when annotate is used with values directly

qs1 = ModelA.objects.all().annotate(field1=F("field_1a"), field2=F("field_2a")).values('field1', 'field2')

The resulting sql is:

select field_1a as field1, field_2a as field2 from ModelA

Now, when we do the union:

qs = qs1.union(qs2)

the sql is:

(select field_1a as field1, field_2a as field2 from ModelA)
UNION
(select field_1b as field1, field_2b as field2 from ModelB)

Now, when qs.values('field1', 'field2') executes, the number of columns returned from union query has 2 columns which is same as the col_count which is 2 and each field is matched with the individual columns producing the expected result.


3. Different field annotation count and ordering of fields

In the OP, there is a scenario when even using .values before union doesn’t produce correct results. The reason for that is that in the ModelB, there is no annotation for extra field.

So, let’s look at the queries generated for each model:

ModelA.objects.all()
        .annotate(
            field1=F("field_1a"), field2=F("field_2a"), extra=F("extra_a")
        )
        .values(*values)

The SQL becomes:

select field_1a as field1, field_2a as field2, extra_a as extra from ModelA

For ModelB:

ModelB.objects.all()
        .annotate(field1=F("field_1b"), field2=F("field_2b"))
        .values(*values)

SQL:

select extra, field_1b as field1, field_2b as field2 from ModelB

and the union is:

(select field_1a as field1, field_2a as field2, extra_a as extra from ModelA)
UNION
(select extra, field_1b as field1, field_2b as field2 from ModelB)

Because annotated fields are listed after the real db fields, the extra of ModelB is mixed with field1 of ModelB. TO make sure that you get correct results, please make sure that the ordering of fields in generated SQL is always correct – with or without annotation. In this case, I will suggest to annotate extra on ModelB as well.

Answered By: AKS

One answer I ask ChatGPT is:
In this answer, just like above, we need to annotate the whole queryset,1 and change the value in union operation every time we want.

To make the story short, you must union separated querysets (the querysets, should not have anything in common!!).

Note: all annotation operation, must be before union in DJANGO

reason why this is happening (ChatGPT): When you perform a union operation in Django, the resulting queryset will have the annotations from the first queryset. In order to annotate different values for each union operation, you need to perform the annotations on the individual querysets before performing the union.

EXAMPLE WRONG:
the union is along with the annotation in each iteration.

    q = Text.objects.none()


    for page in range(start_page, end_page):
        sura_aya = json.loads(page_data[f"{page}"])
        sura_aya_next = json.loads(page_data[f"{page + 1}"])
        q = q | Text.objects.filter(sura=sura_aya[0]).annotate(page=Value(page))

EXAMPLE CORRECT:
the union is completely after the annotation operation.


querysets_to_union = []
    for page in range(start_page, end_page):
        sura_aya = json.loads(page_data[f"{page}"])
        sura_aya_next = json.loads(page_data[f"{page + 1}"])
        qs = Text.objects.filter(sura=sura_aya[0], aya__gte=sura_aya[1], aya__lte=sura_aya_next[1])
        qs = qs.annotate(page=Value(page))
        querysets_to_union.append(qs)

for qs in querysets_to_union[1:]:
    final_qs = final_qs.union(qs)



enter image description here

Answered By: zamoosh