mongoengine query for duplicates in embedded documentlist

Question:

I’m making a python app with mongoengine where i have a mongodb database of n users and each user holds n daily records. I have a list of n new record per user that I want to add to my db

I want to check if a record for a certain date already exists for an user before adding a new record to the user

what i found in the docs is to iterate through every embedded document in the list to check for duplicate fields but thats an O(n^2) algorithm and took 5 solid seconds for 300 records, too long. below an abbreviated version of the code

There’s gotta be a better way to query right? I tried accessing something like user.records.date but that throws a not found

import mongoengine

#snippet here is abbreviated and does not run
# xone of interest in conditional_insert(), line 16

class EmbeddedRecord(mongoengine.EmbeddedDocument):
    date = mongoengine.DateField(required = True)
    #contents = ...

class User(mongoengine.Document):
    #meta{}
    #account details
    records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)


def conditional_insert(user, new_record):
    # the docs tell me to iterate tthrough every record in the user
    # there has to be a better way
    for r in user.records:
        if str(new_record.date) == str(r.date): #i had to do that in my program 
            #because python kep converting datetime obj to str
            return
    # if record of duplicate date not found, insert new record
    save_record(user, new_record)

def save_record(): pass


if __name__ == "__main__":
    lst_to_insert = [] # list of (user, record_to_insert)

    for object in lst_to_insert: #O(n)
        conditional_insert(object[0],object[1]) #O(n)

    #and I have n lst_to_insert so in reality I'm currently at O(n^3)
Asked By: rubhemaju

||

Answers:

Mongo cannot conveniently offer you suitable indexes, very sad.

You frequently iterate over user.records.
If you can afford to allocate the memory for 300 users,
just iterate once and throw them into a set, which

  1. offers O(1) constant time lookup, and
  2. offers RAM speed rather than network latency

When you save a user, also make note of if with cache.add((user_id, str(new_record.date))).

EDIT

If you can’t afford the memory for all those (user_id, date) tuples,
then sure, a relational database JOIN is fair,
it’s just an out-of-core merge of sorted records.

I have had good results with using sqlalchemy to
hit local sqlite (memory or file-backed), or
heavier databases like Postgres or MariaDB.

Bear in mind that relational databases offer lovely ACID guarantees,
but you’re paying for those guarantees. In this application
it doesn’t sound like you need such properties.
Something as simple as /usr/bin/sort could do an out-of-core
ordering operation that puts all of a user’s current
records right next to his historic records,
letting you filter them appropriately.
Sleepycat is not an RDBMS, but its B-tree does offer
external sorting, sufficient for the problem at hand.
(And yes, one can do transactions with sleepycat,
but again this problem just needs some pretty pedestrian reporting.)

Bench it and see.
Without profiling data for a specific workload,
it’s pretty hard to tell if any extra complexity
would be worth it.
Identify the true memory or CPU bottleneck,
and focus just on that.


You don’t necessarily need ordering, as hashing would suffice,
given enough core.
Send those tuples to a redis cache, and make it his problem
to store them somewhere.

Answered By: J_H

Hi everyone (and future me who will probably search for the same question 10 years later)

I optimized the code using the idea of a search tree. Instead of putting all records in a single List in User I broke it down by year and month

class EmbeddedRecord(mongoengine.EmbeddedDocument):
    date = mongoengine.DateField(required = True)
    #contents = ...

class Year(mongoengine.EmbeddedDocument):
    monthly_records = mongoengine.EmbeddedDocumentListField(Month)

class Month(mongoengine.EmbeddedDocument):
    daily_records = mongoengine.EmbeddedDocumentListField(EmbeddedRecord)

class User(mongoengine.Document):
    #meta{}
    #account details
    yearly_records = mongoengine.EmbeddedDocumentListField(Year)

because it’s mongodb, I can later partition by decades, heck even centuries but by that point I dont think this code will be relevant

I then group my data to insert by months into separate pandas dataframe and feed each dataframe separately. The data flow thus looks like:

0) get monthly df    
1) loop through years until we get the right one (lets say 10 steps, i dont think my program will live that long)
2) loop through months until we get the right one (12 steps)
3) for each record in df loop through each daily record in month to check for duplicates

The algorithm to insert with check is still O(n^2) but since there are maximum 31 records at the last step, the code is much faster. I tested 2000 duplicate records and it ran in under a second (didnt actually time it but as long as it feels instant it wont matter that much in my use case)

Answered By: rubhemaju
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.