How are lists converted into sets in Python 3? And what's the performance/complexity of this conversion?

Question:

Maybe I missed something, but I couldn’t find an answer to this on SO or the Python docs.

Got curious about what actually happens under the hood when we write the following code:

maybe_smaller_set = set(some_huge_list)

Since sets don’t have duplicates, I imagine there’s something like a list comprehension or some clever trick (maybe using map() or filter()?) that occurs to avoid iterating over everything in some_huge_list.

So my questions are these:

  1. How does Python actually do this conversion from a list instance into a set instance?
  2. And how costly is this conversion in terms of algorithmic complexity (ie, big-O)?
Asked By: Kevin Flowers Jr

||

Answers:

Python sets are based on hash tables. As such, insertion operations, per-element, are average/expected case amortized O(1), though in pathological cases (or when the insertion requires a rehash, which becomes less and less frequent as it grows) they can approach O(n).

There is no trick that can be used to avoid iterating the input list though; you can’t iterate all the elements without iterating all the elements in a general purpose data structure, no matter the language. The expected case for constructing a set from a list will be O(n), and the theoretical worst case scenario (if literally every element’s hash code collides with every other element, but all elements are unique) would be O(n²).

As for how Python does it, it’s nothing special; internally, it’s roughly equivalent to:

newself = set()
newself.update(iterable)

which is in turn roughly equivalent to:

newself = set()
for elem in iterable:
    newself.add(elem)

it just pushes all the work to the C layer (on the CPython reference interpreter), bypassing the Python layer work such a loop would entail.

To correct an additional misapprehension: There are no clever tricks for map or filter either. In the best case scenario (when called with a built-in function implemented in C), both of them just avoid a little Python level overhead that the equivalent generator expression would entail. But they’re fully general purpose; they use no special tricks for list or set inputs or outputs, they just steadily grind through the input iterator, mapping/filtering each element as they’re requested (when the map/filter object is itself iterated). They’re definitionally O(n).

Answered By: ShadowRanger