Is there anything faster than dict()?

Question:

I need a faster way to store and access around 3GB of k:v pairs. Where k is a string or an integer and v is an np.array() that can be of different shapes.

Is there any object that is faster than the standard python dict in storing and accessing such a table? For example, a pandas.DataFrame?

As far I have understood, python dict is a quite fast implementation of a hashtable. Is there anything better than that for my specific case?

Asked By: alec_djinn

||

Answers:

No, there is nothing faster than a dictionary for this task and that’s because the complexity of its indexing (getting and setting item) and even membership checking is O(1) in average. (check the complexity of the rest of functionalities on Python doc https://wiki.python.org/moin/TimeComplexity )

Once you saved your items in a dictionary, you can have access to them in constant time which means that it’s unlikely for your performance problem to have anything to do with dictionary indexing. That being said, you still might be able to make this process slightly faster by making some changes in your objects and their types that may result in some optimizations at under the hood operations.

e.g. If your strings (keys) are not very large you can intern the lookup key and your dictionary’s keys. Interning is caching the objects in memory –or as in Python, table of "interned" strings– rather than creating them as a separate object.

Python has provided an intern() function within the sys module that you can use for this.

Enter string in the table of “interned” strings and return the interned string – which is string itself or a copy. Interning strings is useful to gain a little performance on dictionary lookup

also …

If the keys in a dictionary are interned and the lookup key is interned, the key comparisons (after hashing) can be done by a pointer comparison instead of comparing the string values themselves which in consequence reduces the access time to the object.

Here is an example:

In [49]: d = {'mystr{}'.format(i): i for i in range(30)}

In [50]: %timeit d['mystr25']
10000000 loops, best of 3: 46.9 ns per loop

In [51]: d = {sys.intern('mystr{}'.format(i)): i for i in range(30)}

In [52]: %timeit d['mystr25']
10000000 loops, best of 3: 38.8 ns per loop
Answered By: Mazdak

You can think of storing them in Data structure like Trie given your key is string. Even to store and retrieve from Trie you need O(N) where N is maximum length of key. Same happen to hash calculation which computes hash for key. Hash is used to find and store in Hash Table. We often don’t consider the hashing time or computation.

You may give a shot to Trie, Which should be almost equal performance, may be little bit faster( if hash value is computed differently for say

HASH[i] = (HASH[i-1] + key[i-1]*256^i % BUCKET_SIZE ) % BUCKET_SIZE 

or something similar due to collision we need to use 256^i.

You can try to store them in Trie and see how it performs.

Answered By: sonus21

No, I don’t think there is anything faster than dict. The time complexity of its index checking is O(1).

-------------------------------------------------------
Operation    |  Average Case  | Amortized Worst Case  |
-------------------------------------------------------
Copy[2]      |    O(n)        |       O(n)            | 
Get Item     |    O(1)        |       O(n)            | 
Set Item[1]  |    O(1)        |       O(n)            | 
Delete Item  |    O(1)        |       O(n)            | 
Iteration[2] |    O(n)        |       O(n)            | 
-------------------------------------------------------

PS https://wiki.python.org/moin/TimeComplexity

Answered By: akash karothiya

A numpy.array[] and simple dict = {} comparison:

import numpy
from timeit import default_timer as timer

my_array = numpy.ones([400,400])

def read_out_array_values():
    cumsum = 0
    for i in range(400):
        for j in range(400):
            cumsum += my_array[i,j]


start = timer()
read_out_array_values()
end = timer()
print("Time for array calculations:" + str(end - start))


my_dict = {}
for i in range(400):
    for j in range(400):
        my_dict[i,j] = 1

def read_out_dict_values():
    cumsum = 0
    for i in range(400):
        for j in range(400):
            cumsum += my_dict[i,j]
    
start = timer()
read_out_dict_values()
end = timer()
print("Time for dict calculations:" + str(end - start))

Prints:

Time for dict calculations:0.046898419999999996
Time for array calculations:0.07558204099999999
============= RESTART: C:/Users/user/Desktop/dict-vs-numpyarray.py =============
Time for array calculations:0.07849989000000002
Time for dict calculations:0.047769446000000104
Answered By: poetyi

One would think that array indexing is faster than hash lookup.

So if we could store this data in a numpy array, and assume the keys are not strings, but numbers, would that be faster than a python a dictionary?

Unfortunately not, because NumPy is optimized for vector operations, not for individual look up of values.
Pandas fares even worse.
See the experiment here: https://nbviewer.jupyter.org/github/annotation/text-fabric/blob/master/test/pandas/pandas.ipynb

The other candidate could be the Python array, in the array module. But that is not usable for variable-size values.
And in order to make this work, you probably need to wrap it into some pure python code, which will set back all time performance gains that the array offers.

So, even if the requirements of the OP are relaxed, there still does not seem to be a faster option than dictionaries.

Answered By: Dirk Roorda