Set "in" operator: uses equality or identity?

Question:

class A(object):
    def __cmp__(self):
        print '__cmp__'
        return object.__cmp__(self)

    def __eq__(self, rhs):
        print '__eq__'
        return True
a1 = A()
a2 = A()
print a1 in set([a1])
print a1 in set([a2])

Why does first line prints True, but second prints False? And neither enters operator eq?

I am using Python 2.6

Asked By: Gennadiy Rozental

||

Answers:

You need to define __hash__ too. For example

class A(object):
    def __hash__(self):
        print '__hash__'
        return 42

    def __cmp__(self, other):
        print '__cmp__'
        return object.__cmp__(self, other)

    def __eq__(self, rhs):
        print '__eq__'
        return True

a1 = A()
a2 = A()
print a1 in set([a1])
print a1 in set([a2])

Will work as expected.

As a general rule, any time you implement __cmp__ you should implement a __hash__ such that for all x and y such that x == y, x.__hash__() == y.__hash__().

Answered By: David Wolever

Sets seem to use hash codes, then identity, before comparing for equality. The following code:

class A(object):
    def __eq__(self, rhs):
        print '__eq__'
        return True
    def __hash__(self):
        print '__hash__'
        return 1

a1 = A()
a2 = A()

print 'set1'
set1 = set([a1])

print 'set2'
set2 = set([a2])

print 'a1 in set1'
print a1 in set1

print 'a1 in set2'
print a1 in set2

outputs:

set1
__hash__
set2
__hash__
a1 in set1
__hash__
True
a1 in set2
__hash__
__eq__
True

What happens seems to be:

  1. The hash code is computed when an element is inserted into a hash. (To compare with the existing elements.)
  2. The hash code for the object you’re checking with the in operator is computed.
  3. Elements of the set with the same hash code are inspected by first checking whether they’re the same object as the one you’re looking for, or if they’re logically equal to it.
Answered By: millimoose

Sets and dictionaries gain their speed by using hashing as a fast approximation of full equality checking. If you want to redefine equality, you usually need to redefine the hash algorithm so that it is consistent.

The default hash function uses the identity of the object, which is pretty useless as a fast approximation of full equality, but at least allows you to use an arbitrary class instance as a dictionary key and retrieve the value stored with it if you pass exactly the same object as a key. But it means if you redefine equality and don’t redefine the hash function, your objects will go into a dictionary/set without complaining about not being hashable, but still won’t actually work the way you expect them to.

See the official python docs on __hash__ for more details.

Answered By: Ben

A tangential answer, but your question and my testing made me curious. If you ignore the set operator which is the source of your __hash__ problem, it turns out your question is still interesting.

Thanks to the help I got on this SO question, I was able to chase the in operator through the source code to it’s root. Near the bottom I found the PyObject_RichCompareBool function which indeed tests for identity (see the comment about “Quick result”) before testing for equality.

So unless I misunderstand the way things work, the technical answer to your question is first identity and then equality, through the equality test itself. Just to reiterate, that is not the source of the behavior you were seeing but just the technical answer to your question.

If I misunderstood the source, somebody please set me straight.

int
PyObject_RichCompareBool(PyObject *v, PyObject *w, int op)
{
    PyObject *res;
    int ok;

    /* Quick result when objects are the same.
       Guarantees that identity implies equality. */
    if (v == w) {
        if (op == Py_EQ)
            return 1;
        else if (op == Py_NE)
            return 0;
    }

    res = PyObject_RichCompare(v, w, op);
    if (res == NULL)
        return -1;
    if (PyBool_Check(res))
        ok = (res == Py_True);
    else
        ok = PyObject_IsTrue(res);
    Py_DECREF(res);
    return ok;
}
Answered By: KobeJohn

Set __contains__ makes checks in the following order:

 'Match' if hash(a) == hash(b) and (a is b or a==b) else 'No Match'

The relevant C source code is in Objects/setobject.c::set_lookkey() and in Objects/object.c::PyObject_RichCompareBool().

Answered By: Raymond Hettinger