Count number of occurrences of dictionary values as nested list based on a query

Question:

I built this inverted index:

{
    'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...}, 

    'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}

}

For example, the term experiment appears in document 1 one time at index zero, in document 30 two times at indices 12 and 40, etc. I am wondering how I could count the number of occurrences of each term in the dictionary based on a dictionary of queries that looks like this:

{
    'q1'  : ['similar', 'law', ..., 'speed', 'aircraft'],
    'q2'  : ['structur', 'aeroelast', ..., 'speed', 'aircraft'], 
    ...
    'q225': ['design', 'factor', ..., 'number', '5']
}

The desired output would look something like this:

{
    'q1'  : ['d51', 'd874', ..., 'd717'], 
    'q2'  : ['d51', 'd1147', ..., 'd14'],
    ...,
    'q225': ['d1313', 'd996', ..., 'd193']
}

With keys representing the query and values representing the documents that the query appeared in, and the list would be sorted in descending order of total term frequencies

Asked By: Julien

||

Answers:

Map queries to document vectors

A document vector is a dict with items (document, word_count). These vectors can be added together by summing the word count for matching document keys with a default word_count of 0.

CONVERT INDEX TO DOC VECTORS

full_index = {
    'experiment': {'d1': [1, [0]],  'd30': [2, [12, 40]],  'd123': [3, [11, 45, 67]] } ,
    'study': {'d1': [1, [1]], 'd2': [2, [0, 36]],  'd207': [3, [19, 44, 59]]}
}

def count_only(docs):
    return {d: occurences[0] for d, occurences in docs.items()}

doc_vector_index = {w: count_only(docs) for w, docs in full_index.items()}

MAP LIST OF QUERY WORDS TO LIST OF DOC VECTORS

for q, words in queries.items():
    vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]

SUM DOC VECTORS AND SORT

def doc_vector_add(ldoc, rdoc):
    res = ldoc.copy()
    for doc, count in rdoc.items():
        res[doc] = ldoc.get(doc,0) + count
    return res

for q, words in queries.items():
    vectors = [doc_vector_index[word] for word in words if word in doc_vector_index.keys()]
    total_vector = dict(sorted(functools.reduce(doc_vector_add, vectors, {}).items(), 
        key=lambda item: item[1], 
        reverse=True))
    output[q] = list(total_vector.keys())

The summation of doc vectors is handled using reduce functools.reduce(doc_vector_add, vectors, {}). This produces the doc vector that is the sum of the individual vectors for each word in the query. sorted is used to sort the keys of the vector.

LIMIT TO TOP N DOCUMENTS

max_doc_limit = 10
output[q] = list(total_vector.keys())[:max_doc_limit]

Limiting the documents can be handled by slicing before assigning to the output.

ORDER BY COUNT DESC, DOC_ID ASC

sorted(...,key=lambda item: (item[1], -1*int(item[0][1:]),...)

We can change the sorting order of the output by changing the key function passed to sorted. We use a trick of multiplying the second element in the tuple by -1 to reverse the order from descending to ascending.

Answered By: SargeATM