sorted() using generator expressions rather than lists

Question:

After seeing the discussion here: Python – generate the time difference I got curious. I also initially thought that a generator is faster than a list, but when it comes to sorted() I don’t know. Is there any benefit to sending a generator expression to sorted() rather than a list? Does the generator expression end up being made into a list inside sorted() before sorting anyway?

EDIT: It grieves me to only be able to accept one answer, as I feel a lot of responses have helped to clarify the issue. Thanks again to everyone.

Asked By: Brent Newey

||

Answers:

There’s no way to sort a sequence without knowing all the elements of the sequence, so any generator passed to sorted() is exhausted.

The first thing sorted() does is to convert the data to a list. Basically the first line (after argument validation) of the implementation is

newlist = PySequence_List(seq);

See also the full source code version 2.7 and version 3.1.2.

Edit: As pointed out in the answer by aaronasterling, the variable newlist is, well, a new list. If the parameter is already a list, it is copied. So a generator expression really has the advantage of using less memory.

Answered By: Sven Marnach

I also initially thought that a list
comprehension is faster than a list

What do you mean faster than a list? Do you mean faster than an explicit for? For that I will say it depends: The list comprehension is more like a syntactic sugar, but it’s very handy when it comes to simple loop.

but when it comes to sorted() I don’t
know. Is there any benefit to sending
a generator expression to sorted()
rather than a list?

The main difference between List comprehensions and Generator expressions is that the Generator expressions avoid the overhead of generating the entire list at once. Instead, they return a generator object which can be iterated one by one, so the Generator expressions are more likely used to save memory usage.

But you have to understand one thing in Python: It’s very hard to tell if one way is faster (optimistic) than another way just by looking at it, and if you want to do that you should use timeit for benchmarking (and benchmarking is more complex than just running one timeit on a single machine).

Read this for more info about some optimization techniques.

Answered By: mouad

Python uses Timsort. Timsort needs to know the total number of elements up front, to compute the minrun parameter. Thus, as Sven reports, the first thing that sorted does when given a generator is to turn it into a list.

That said, it would be possible to write an incremental version of Timsort, which consumed values from the generator more slowly – you’d just have to fix minrun before starting, and accept the pain of having some unbalanced merges at the end. Timsort works in two phases. The first phase involves a pass through the whole array, identifying runs and doing insertion sort to make runs where the data is unordered. Both run-finding and insertion sort are inherently incremental. The second phase involves a merge of the sorted runs; that would happen exactly as now.

I don’t think there would be a lot of point in this, though. Perhaps it would make memory management easier, because rather than having to read from the generator into a constantly-growing array (as i baselessly assume the current implementation does), you could read each run into a small buffer, then only allocate a final-sized buffer once, at the end. However, this would involve having 2N slots of array in memory at once, whereas a growing array can be done with 1.5N if it doubles when it grows. So, probably not a good idea.

Answered By: Tom Anderson

There’s a huge benefit. Because sorted doesn’t affect the passed in sequence, it has to make a copy of it. If it’s making a list from the generator expression, then only one list gets made. If a list comprehension is passed in, then first, that gets built and then sorted makes a copy of it to sort.

This is reflected in the line

newlist = PySequence_List(seq);

quoted in Sven Marnach’s answer. Essentially, this will unconditionally make a copy of whatever sequence is passed to it.

Answered By: aaronasterling

The easiest way to see which is faster is to use timeit and it tells me that it’s faster to pass a list rather than a generator:

>>> import random
>>> randomlist = range(1000)
>>> random.shuffle(randomlist)
>>> import timeit
>>> timeit.timeit("sorted(x for x in randomlist)",setup = "from __main__ import randomlist",number = 10000)
4.944492386602178
>>> timeit.timeit("sorted([x for x in randomlist])",setup = "from __main__ import randomlist",number = 10000)
4.635165083830486

And:

>>> timeit.timeit("sorted(x for x in xrange(1000,1,-1))",number = 10000)
1.411807087213674
>>> timeit.timeit("sorted([x for x in xrange(1000,1,-1)])",number = 10000)
1.0734657617099401

I think this is because when sorted() converts the incoming value to a list it can do this more quickly for something that is already a list than for a generator. The source code seems to confirm this (but this is from reading the comments rather than fully understanding everything that is going on).

Answered By: Dave Webb

If performance is important why not process the data as it is yielded by the generator, and apply the ordering over results of the iterations? Of course this could be used only if there is no causal conditioning between iterations (i.e. the data of sorted iteration #[i] is not needed to do any calculation for sorted iteration #[i + 1]).
What I am trying to say in this case is that sorting a set of potentially larger structures yielded by the generator might be adding a lot of unnecessary complexity to an ordering which might take place posterior to processing all elements.

I should just add to Dave Webb’s timing answer [I put in what may be an anonymous edit], that when you access an optimized generator directly, it may be much faster; much of the overhead may be the code’s creation of a list or generator of its own:

>>> timeit.timeit("sorted(xrange(1000, 1, -1))", number=10000)
0.34192609786987305
>>> timeit.timeit("sorted(range(1000, 1, -1))", number=10000)
0.4096639156341553
>>> timeit.timeit("sorted([el for el in xrange(1000, 1, -1)])", number=10000)
0.6886589527130127
>>> timeit.timeit("sorted(el for el in xrange(1000, 1, -1))", number=10000)
0.9492318630218506
Answered By: Mark
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.