Complexity of len() with regard to sets and lists

Question:

The complexity of len() with regards to sets and lists is equally O(1). How come it takes more time to process sets?

~$ python -m timeit "a=[1,2,3,4,5,6,7,8,9,10];len(a)"
10000000 loops, best of 3: 0.168 usec per loop
~$ python -m timeit "a={1,2,3,4,5,6,7,8,9,10};len(a)"
1000000 loops, best of 3: 0.375 usec per loop

Is it related to the particular benchmark, as in, it takes more time to build sets than lists and the benchmark takes that into account as well?

If the creation of a set object takes more time compared to creating a list, what would be the underlying reason?

Asked By: Omid

||

Answers:

Yes,you are right,it’s more because of the different time required for creating the set and list objects by python. As a fairer benchmark you can use timeit module and pass the objects using setup argument:

from timeit import timeit

print '1st: ' ,timeit(stmt="len(a)", number=1000000,setup="a=set([1,2,3]*1000)")
print '2nd : ',timeit(stmt="len(a)", number=1000000,setup="a=[1,2,3]*1000")

result :

1st:  0.04927110672
2nd :  0.0530669689178

And if you want to know that why it’s like so, lets go through the python world. Actually set object use a hash table and a hash table uses a hash function for creating the hash values of the items and mapping them to the values and in this deal calling the function and calculating the hash values and some another extra tasks will take much time. While for creating a list python just create a sequence of objects which you can access them with indexing.

You can check the more details on set_lookkey function from Cpython source code.

Also note that if two algorithm had same complexity it does not mean that both algorithms has exactly same run time, or execution speed.1



because big O notation describes the limiting behavior of a function and doesn’t show the exact complexity equation.
For example the complexity of following equations f(x)=100000x+1 and f(x)=4x+20 is O(1)
and it means that both are linear equations bur as you can see the first function has a pretty much larger slope, and for a same input they will gives different result.

Answered By: Mazdak

Remove the len(a) statement. The result is pretty much the same. A set needs to be hashed to retain only distinct items so it’s slower.

Answered By: Code Different

The relevant lines are http://svn.python.org/view/python/trunk/Objects/setobject.c?view=markup#l640

640     static Py_ssize_t
641     set_len(PyObject *so)
642     {
643         return ((PySetObject *)so)->used;
644     }

and http://svn.python.org/view/python/trunk/Objects/listobject.c?view=markup#l431

431     static Py_ssize_t
432     list_length(PyListObject *a)
433     {
434         return Py_SIZE(a);
435     }

Both are only a dynamic lookup.

So what is the difference you may ask. You measure the creation of the objects, too. And it is a little more time consuming to create a set than a list.

Answered By: Kijewski

Firstly, you have not measured the speed of len(), you have measured the speed of creating a list/set together with the speed of len().

Use the --setup argument of timeit:

$ python -m timeit --setup "a=[1,2,3,4,5,6,7,8,9,10]" "len(a)"
10000000 loops, best of 3: 0.0369 usec per loop
$ python -m timeit --setup "a={1,2,3,4,5,6,7,8,9,10}" "len(a)"
10000000 loops, best of 3: 0.0372 usec per loop

The statements you pass to --setup are run before measuring the speed of len().

Secondly, you should note that len(a) is a pretty quick statement. The process of measuring its speed may be subject to “noise”. Consider that the code executed (and measured) by timeit is equivalent to the following:

for i in itertools.repeat(None, number):
    len(a)

Because both len(a) and itertools.repeat(...).__next__() are fast operations and their speeds may be similar, the speed of itertools.repeat(...).__next__() may influence the timings.

For this reason, you’d better measure len(a); len(a); ...; len(a) (repeated 100 times or so) so that the body of the for loop takes a considerably higher amount of time than the iterator:

$ python -m timeit --setup "a=[1,2,3,4,5,6,7,8,9,10]" "$(for i in {0..1000}; do echo "len(a)"; done)"
10000 loops, best of 3: 29.2 usec per loop
$ python -m timeit --setup "a={1,2,3,4,5,6,7,8,9,10}" "$(for i in {0..1000}; do echo "len(a)"; done)"
10000 loops, best of 3: 29.3 usec per loop

(The results still says that len() has the same performances on lists and sets, but now you are sure that the result is correct.)

Thirdly, it’s true that “complexity” and “speed” are related, but I believe you are making some confusion. The fact that len() has O(1) complexity for lists and sets does not imply that it must run with the same speed on lists and sets.

It means that, on average, no matter how long the list a is, len(a) performs the same asymptotic number of steps. And no matter how long the set b is, len(b) performs the same asymptotic number of steps. But the algorithm for computing the size of lists and sets may be different, resulting in different performances (timeit shows that this is not the case, however this may be a possibility).

Lastly,

If the creation of a set object takes more time compared to creating a list, what would be the underlying reason?

A set, as you know, does not allow repeated elements. Sets in CPython are implemented as hash tables (to ensure average O(1) insertion and lookup): constructing and maintaining a hash table is much more complex than adding elements to a list.

Specifically, when constructing a set, you have to compute hashes, build the hash table, look it up to avoid inserting duplicated events and so on. By contrast, lists in CPython are implemented as a simple array of pointers that is malloc()ed and realloc()ed as required.

Answered By: Andrea Corbellini

Use this with the -s flag to timeit without taking into account the first string:

~$ python -mtimeit -s "a=range(1000);" "len(a)"
10000000 loops, best of 3: 0.0424 usec per loop
                           ↑ 

~$ python -mtimeit -s "a={i for i in range(1000)};" "len(a)"
10000000 loops, best of 3: 0.0423 usec per loop
                           ↑ 

Now it’s only considering only the len function, and the results are pretty much the same since we didn’t take into account the creation time of the set/list.

Answered By: Maroun

Let me compound the excellent answers here: O(1) only tells you about the order of growth with respect to the size of the input.

O(1) in particular only means constant time with respect to the size of input.
A method may always take 0.1s, for any input, and another may take 1000 years for any input, and they’d both be O(1)

In this case, while the documentation has some degree of ambiguity, it means that the method takes roughly the same time to process a list of size 1 as it takes to process list of size 1000; similarly, it takes the same time to process a dictionary of size 1 as it takes to process a dictionary of size 1000.

No guarantee is given with respect to different data types.

This is unsurprising since the implementation of len() at some point down the call stack can differ depending on the data type.

Incidentally, this ambiguity is eliminated in statically typed languages where ClassA.size() and ClassB.size() are for all intents and purpouses two different methods.

Answered By: Tobia Tesan

Many have noted that O(1) is not about performance on different data types, but about performance as a function of different input sizes.

If you’re trying to test O(1)-ness, you’d be looking for something more like

~$python -m timeit --setup "a=list(range(1000000))" "len(a)"
10000000 loops, best of 3: 0.198 usec per loop

~$python -m timeit --setup "a=list(range(1))" "len(a)"
10000000 loops, best of 3: 0.156 usec per loop

Big data or little data, the time taken is quite similar. Per other posts, this separates setup time from testing time, but doesn’t go as far as to reduce noise of len-time vs loop-time.

Answered By: Bryant