Python calling methods that return booleans, like `issuperset`, memory friendly?

Question:

I wrote some code assuming that a check for a superset would be memory friendly and lead to less fragmentation since it returns a boolean (a_list is always no bigger than 2 elements of very small strings, on the same order as foo and bar). e.g.

OK_SET = set('foo', 'bar')

def are_args_ok(a_list):
    if not OK_SET.issuperset(a_list): # expected to run a lot
        raise ValueError('bad value in a_list') # virtually never

And I considered the above preferable to the below, if only for readability, but also because I assumed it’s better not to create lots of unnecessary lists, and I would hope it doesn’t create any other objects since it only returns a boolean.

def are_args_ok(a_list):
    if [i for i in a_list if i not in ['foo', 'bar']]: # expected to run a lot
        raise ValueError('bad value in a_list') # virtually never

However, I’m not clear on all of the inner workings of Python. Therefore, I’ve been reading the CPython source (excerpt below) and it appears the check for a superset creates a set object of the other if it’s not already a set:

static PyObject *
set_issuperset(PySetObject *so, PyObject *other)
{
    PyObject *tmp, *result;

    if (!PyAnySet_Check(other)) {
        tmp = make_new_set(&PySet_Type, other);
        if (tmp == NULL)
            return NULL;
        result = set_issuperset(so, tmp);
        Py_DECREF(tmp);
        return result;
    }
    return set_issubset((PySetObject *)other, (PyObject *)so);
}

So it looks like I create a new set when given a list as my other, so my assumption was wrong, even if it is more readable. I think the second code may actually be faster, at least it is when I test with Python 2.6. So my question is, is the first code preferable to the second in terms of memory performance and fragmentation?

Is there a strictly dominating approach that I have not yet considered?

This answers relevant questions about performance:

must= '''MUST=set(['a','b'])

def validate(vals):
    if not MUST.issuperset(vals):
        raise Exception'''

mustdiff= '''MUST=set(['a','b'])

def validate(vals):
    if set(vals) - MUST:
        raise Exception'''

must2= '''def validate(vals):
    if not set(['a','b']).issuperset(vals):
        raise Exception'''

old_list = '''def validate(vals):
    if [i for i in vals if i not in ['a','b']]:
        raise Exception
'''

old_tup = '''def validate(vals):
    if [i for i in vals if i not in ('a','b')]:
        raise Exception
'''
test = "validate(['a']); validate(['a', 'b'])"

def main():
    print timeit.repeat(test, setup=must)
    print timeit.repeat(test, setup=mustdiff)
    print timeit.repeat(test, setup=must2)
    print timeit.repeat(test, setup=old_list)
    print timeit.repeat(test, setup=old_tup)

outputs:

[0.90473995592992651, 0.90407950738062937, 0.90170756738780256]
[1.0068785656071668, 1.0049370642036592, 1.0076947689335611]
[1.4705243140447237, 1.4697376920521492, 1.4727534788248704]
[0.74187539617878429, 0.74010685502116758, 0.74236680853618964]
[0.74886594826284636, 0.74639892541290465, 0.74475293549448907]

Answers:

I think the second code may actually be faster, at least it is when I
test with Python 2.6.

I would be shocked if that were the case — what are the sizes of the lists in question? My guess is that, maybe, the conversion of the list into a set has some constant overhead that negates any performance benefit from using the set operations.

The set operations give you asymptotically optimal performance for this kind of operation. I would expect the issuperset method to give you the best performance, followed perhaps by:

if not set(a_list) - OK_SET:...

with O(len(a_list)) performance. Note that using the global variable OK_SET is also going to significantly hurt your performance.

That being said: unless you are testing sets which contain thousands of elements, the difference is probably negligible. Premature optimization is the root of all evil. If your production code is actually only testing two elements, I doubt you will find much difference.

Answered By: Patrick Collins

For such a small number of items, this seems to be slightly faster than several other things I tried:

    .
    .
    .
kiss = '''MUST=['a','b']

def validate(vals):
    for i in vals:
        if i not in MUST:
            raise Exception
'''

test = "validate(['a']); validate(['a', 'b'])"

def main():
    print '    must:', min(timeit.repeat(test, setup=must))
    print 'mustdiff:', min(timeit.repeat(test, setup=mustdiff))
    print '   must2:', min(timeit.repeat(test, setup=must2))
    print 'old_list:', min(timeit.repeat(test, setup=old_list))
    print ' old_tup:', min(timeit.repeat(test, setup=old_tup))
    print '    kiss:', min(timeit.repeat(test, setup=kiss))

if __name__ == '__main__':
    main()
Answered By: martineau
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.