Create inverted index from a dictionary with document ids as keys and a list of terms as values for each document

Question:

I have created the following dictionary from the Cranfield Collection:

{
    'd1'   : ['experiment', 'studi', ..., 'configur', 'experi', '.'], 
    'd2'   : ['studi', 'high-spe', ..., 'steadi', 'flow', '.'],
    ..., 
    'd1400': ['report', 'extens', ..., 'graphic', 'form', '.']
}

Each key, value pair represents a single document as the key and the value as a list of tokenized, stemmed words with stopwords removed. I need to create an inverted index from this dictionary with the following format:

{
    'experiment': {'d1': [1, [0]], ..., 'd30': [2, [12, 40]], ..., 'd123': [3, [11, 45, 67]], ...}, 

    'studi': {'d1': [1, [1]], 'd2': [2, [0, 36]], ..., 'd207': [3, [19, 44, 59]], ...}

    ...
}

Here the key becomes the term while the value is a dictionary that contains the document that term shows up in, the number of times, and the indices of the document where the term is found. I am not sure how to approach this conversion so I am just looking for some starter pointers as to how to think about this problem. Thank you.

Asked By: Julien

||

Answers:

I hope I’ve understood your question well:

dct = {
    "d1": ["experiment", "studi", "configur", "experi", "."],
    "d2": ["studi", "high-spe", "steadi", "flow", "flow", "."],
    "d1400": ["report", "extens", "graphic", "form", "."],
}

out = {}
for k, v in dct.items():
    for idx, word in enumerate(v):
        out.setdefault(word, {}).setdefault(k, []).append(idx)

for v in out.values():
    for l in v.values():
        l[:] = [len(l), list(l)]

print(out)

Prints:

{
    "experiment": {"d1": [1, [0]]},
    "studi": {"d1": [1, [1]], "d2": [1, [0]]},
    "configur": {"d1": [1, [2]]},
    "experi": {"d1": [1, [3]]},
    ".": {"d1": [1, [4]], "d2": [1, [5]], "d1400": [1, [4]]},
    "high-spe": {"d2": [1, [1]]},
    "steadi": {"d2": [1, [2]]},
    "flow": {"d2": [2, [3, 4]]},
    "report": {"d1400": [1, [0]]},
    "extens": {"d1400": [1, [1]]},
    "graphic": {"d1400": [1, [2]]},
    "form": {"d1400": [1, [3]]},
}
Answered By: Andrej Kesely