Is "add" operation in set() or "insert" in dict() in Python actually O(n) where n is the length of the key string?

Question:

There is a contradiction on whether the insert operation on dict() or add operation in set() is O(n) or O(1), where n is the length of the string.

Suppose we have strings which vary in length i.e. n1, n2, …n_x. Then the time complexity of performing the following:

s = set()
d = dict()
for x in {N}: # where N = [n1, n2, ... n_x]
  s.add(x)
  d[x] = 1

is O(len(N) * Z) where Z = len(n_1) + len(n_2) + ... len(n_x)
If we assume that add or insert is O(1) operation then the time complexity will be O(len(N)).

Is the above true?

From: http://svn.python.org/projects/python/trunk/Objects/stringobject.c
we see that the computation of the hash depends on the length of the string which is what I am assuming is len below:

static long string_hash(PyStringObject *a)
{
    register Py_ssize_t len;
    register unsigned char *p;
    register long x;

    if (a->ob_shash != -1)
        return a->ob_shash;
    len = Py_SIZE(a);
    p = (unsigned char *) a->ob_sval;
    x = *p << 7;
    while (--len >= 0)
        x = (1000003*x) ^ *p++;
    x ^= Py_SIZE(a);
    if (x == -1)
        x = -2;
    a->ob_shash = x;
    return x;
}

Here (efficiency of long (str) keys in python dictionary) someone showed that
varying the length of the string does not affect the time of computing a hash. But this contradicts the above code.

From the following link we know that the hash value once computed is stored in the object.
This means that lookup will be constant time O(1).
Get dictionary keys hashes without recalculation
However, the insertion/adding which is when the computation of the hash is done should be linear.

Asked By: block-ch

||

Answers:

There are gazillion things on which the performance of insert depends on. The calculation of hash function indeed is O(k) for a string of length k, but it is just uninteresting in general case.

If you consider string keys of only 8 bytes of length, there are 18446744073709551616 different combinations and 8 is a constant, calculation of hash for 8-byte key is O(8) is O(1).

But at 18446744073709551616 items, insertion to hash table could still take 1 µs. And for list, where insertion to the beginning would be O(n), and the insertion/copying of one item took only one nanosecond at the end of the list, insertion to the beginning of a list of that many items could take 585 years.

OTOH, while it is conceivable that you might have a collection of 4294967296 or even 18446744073709551616 items, if you’ve got a key of 4294967296 or 18446744073709551616 bytes to your hash table you seriously need to rethink your architecture.

I added a new answer on the link that you gave, and I link it again here.
https://stackoverflow.com/a/70252226/800337

To justify why it’s O(N) on the length of the key, and why a cached hash value doesn’t really matter, imagine this scenario…

You have a filesystem full of files, with an average size of 30 Kbytes each. You want to find duplicate files, and your strategy is to read files one at a time, and do something like this:

with open(filespec,'rb') as f:
    contents = f.read()
# If file contents are new, insert the contents as a key
#  and the filespec as the value
# If file contents are a duplicate, don't change the
#  dictionary, and return the first filespec where we
#  saw these contents.
spec = filedict.setdefault(contents, filespec)
if spec != filespec:
    print(f'{filespec} is a duplicate of {spec}')

I’m not saying this is the best way to find duplicate files; there are
clearly better ways. But lets analyze it anyway.

For each file that we read in, the call to the setdefault() method does
need to hash the file contents. This is because we didn’t get the
content bytes object from someplace that was already hashed, we got it
from a file read. (If we were looking up the same file content bytes
in multiple dictionaries, yeah, the second and subsequent lookups can do
hashing in O(1), but in this use case, that doesn’t apply.)

So lookup of a newly generated bytes object is O(N) just because we haven’t cached a hash value yet.

Now assume that the lookup actually hits in the dictionary — we’ve seen a file before with the same contents. The lookup isn’t over just because we found an item in the right hash bucket — the implementation is now going to do a bytes comparison of the new bytes object against the bytes object in the dictionary. Because they match, the entire bytes object will be compared — all 30 Kbytes or whatever.

So for this use case, finding a duplicate file results in two operations which are O(N) with respect to the length of the files — after reading the file, we need to hash it, and on lookup, we need to do a complete byte comparison of the new key and the existing key.

A key takeaway from this is that if you’re looking up a lengthy key which is already in the dictionary, you’re going to do that O(N) comparison of the key being looked up against the key that’s in the dictionary — unless the two keys are the same exact object, in which case Python is generally smart enough to trivially decide that if (a is b) then (a == b).

However, as implied by the timing statistics I gave on the other linked answer, you can probably assume that for small keys up to about 1K in length, the O(N) contribution to hashing and key comparison is small enough to be ignored. It doesn’t really start to show up in a noticeable way until the key lengths start to climb significantly above 1K in size.

Answered By: John Brennen