"lambda" vs. "operator.attrgetter('xxx')" as a sort key function

Question:

I am looking at some code that has a lot of sort calls using comparison functions, and it seems like it should be using key functions.

If you were to change seq.sort(lambda x, y: cmp(x.xxx, y.xxx)), which is preferable:

seq.sort(key=operator.attrgetter('xxx'))

or:

seq.sort(key=lambda a: a.xxx)

I would also be interested in comments on the merits of making changes to existing code that works.

Asked By: PaulMcG

||

Answers:

“Making changes to existing code that works” is how programs evolve;-). Write a good battery of tests that give known results with the existing code, save those results (that’s normally known as “golden files” in a testing context); then make the changes, rerun the tests, and verify (ideally in an automated way) that the only changes to the tests’ results are those that are specifically intended to be there — no undesired or unexpected side effects. One can use more sophisticated quality assurance strategies, of course, but this is the gist of many “integration testing” approaches.

As for the two ways to write simple key= function, the design intent was to make operator.attrgetter faster by being more specialized, but at least in current versions of Python there’s no measurable difference in speed. That being the case, for this special situation I would recommend the lambda, simply because it’s more concise and general (and I’m not usually a lambda-lover, mind you!-).

Answered By: Alex Martelli

When choosing purely between attrgetter('attributename') and lambda o: o.attributename as a sort key, then using attrgetter() is the faster option of the two.

Remember that the key function is only applied once to each element in the list, before sorting, so to compare the two we can use them directly in a time trial:

>>> from timeit import Timer
>>> from random import randint
>>> from dataclasses import dataclass, field
>>> @dataclass
... class Foo:
...     bar: int = field(default_factory=lambda: randint(1, 10**6))
...
>>> testdata = [Foo() for _ in range(1000)]
>>> def test_function(objects, key):
...     [key(o) for o in objects]
...
>>> stmt = 't(testdata, key)'
>>> setup = 'from __main__ import test_function as t, testdata; '
>>> tests = {
...     'lambda': setup + 'key=lambda o: o.bar',
...     'attrgetter': setup + 'from operator import attrgetter; key=attrgetter("bar")'
... }
>>> for name, tsetup in tests.items():
...     count, total = Timer(stmt, tsetup).autorange()
...     print(f"{name:>10}: {total / count * 10 ** 6:7.3f} microseconds ({count} repetitions)")
...
    lambda: 130.495 microseconds (2000 repetitions)
attrgetter:  92.850 microseconds (5000 repetitions)

So applying attrgetter('bar') 1000 times is roughly 40 μs faster than a lambda. That’s because calling a Python function has a certain amount of overhead, more than calling into a native function such as produced by attrgetter().

This speed advantage translates into faster sorting too:

>>> def test_function(objects, key):
...     sorted(objects, key=key)
...
>>> for name, tsetup in tests.items():
...     count, total = Timer(stmt, tsetup).autorange()
...     print(f"{name:>10}: {total / count * 10 ** 6:7.3f} microseconds ({count} repetitions)")
...
    lambda: 218.715 microseconds (1000 repetitions)
attrgetter: 169.064 microseconds (2000 repetitions)
Answered By: Martijn Pieters

As stated by previous commenters, attrgetter is slightly faster, but for a lot of situations the difference is marginal (~microseconds).

Regarding readability, I personally prefer lambda as it’s a construct that people will have seen before in different contexts, so it will probably be easier for others to read and understand.

One other caveat is that your IDE should be able to signal a typo on the attr name when using lambda, unlike using attrgetter.

In general I tend to choose the construct that does not require an extra import if the alternative is easy enough to write and read.

Answered By: YBadiss
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.