Idiomatic way to do list/dict in Cython?

Question

My problem: I’ve found that processing large data sets with raw C++ using the STL map and vector can often be considerably faster (and with lower memory footprint) than using Cython.

I figure that part of this speed penalty is due to using Python lists and dicts, and that there might be some tricks to use less encumbered data structures in Cython. For example, this page (http://wiki.cython.org/tutorials/numpy) shows how to make numpy arrays very fast in Cython by predefining the size and types of the ND array.

Question: Is there any way to do something similar with lists/dicts, e.g. by stating roughly how many elements or (key,value) pairs you expect to have in them? That is, is there an idiomatic way to convert lists/dicts to (fast) data structures in Cython?

If not I guess I’ll just have to write it in C++ and wrap in a Cython import.

Asked By: ramanujan

||

Source

Answer 1

You can take a look at the standard array module for Python if this is appropriate for your Cython setting. I’m not sure since I have never used Cython.

Answered By: Andrey Vlasovskikh

Answer 2

There is no way to get native Python lists/dicts up to the speed of a C++ map/vector or even anywhere close. It has nothing to do with allocation or type declaration but rather paying the interpreter overhead. The example you mention (numpy) is a C extension and is written in C for precisely this reason.

Answered By: Karl Guertin

Answer 3

Doing similar operations in Python as in C++ can often be slower. list and dict are actually implemented very well, but you gain a lot of overhead using Python objects, which are more abstract than C++ objects and require a lot more lookup at runtime.

Incidentally, std::vector is implemented in a pretty similar way to list. std::map, though, is actually implemented in a way that many operations are slower than dict as its size gets large. For suitably large examples of each, dict overcomes the constant factor by which it’s slower than std::map and will actually do operations like lookup, insertion, etc. faster.

If you want to use std::map and std::vector, nothing is stopping you. You’ll have to wrap them yourself if you want to expose them to Python. Do not be shocked if this wrapping consumes all or much of the time you were hoping to save. I am not aware of any tools that make this automatic for you.

There are C API calls for controlling the creation of objects with some detail. You can say “Make a list with at least this many elements”, but this doesn’t improve the overall complexity of your list creation-and-filling operation. It certainly doesn’t change much later as you try to change your list.

My general advice is

If you want a fixed-size array (you talk about specifying the size of a list), you may actually want something like a numpy array.
I doubt you are going to get any speedup you want out of using std::vector over list for a general replacement in your code. If you want to use it behind the scenes, it may give you a satisfying size and space improvement (I of course don’t know without measuring, nor do you. 😉 ).
dict actually does its job really well. I definitely wouldn’t try introducing a new general-purpose type for use in Python based on std::map, which has worse algorithmic complexity in time for many important operations and—in at least some implementations—leaves some optimisations to the user that dict already has.

If I did want something that worked a little more like std::map, I’d probably use a database. This is generally what I do if stuff I want to store in a dict (or for that matter, stuff I store in a list) gets too big for me to feel comfortable storing in memory. Python has sqlite3 in the stdlib and drivers for all other major databases available.

Answered By: Mike Graham

Answer 4

C++ is fast not just because of the static declarations of the vector and the elements that go into it, but crucially because using templates/generics one specifies that the vector will only contain elements of a certain type, e.g. vector with tuples of three elements. Cython can’t do this last thing and it sounds nontrivial — it would have to be enforced at compile time, somehow (typechecking at runtime is what Python already does). So right now when you pop something off a list in Cython there is no way of knowing in advance what type it is , and putting it in a typed variable only adds a typecheck, not speed. This means that there is no way of bypassing the Python interpreter in this regard, and it seems to me it’s the most crucial shortcoming of Cython for non-numerical tasks.

The manual way of solving this is to subclass the python list/dict (or perhaps std::vector) with a cdef class for a specific type of element or key-value combination. This would amount to the same thing as the code that templates are generating. As long as you use the resulting class in Cython code it should provide an improvement.

Using databases or arrays just solves a different problem, because this is about putting arbitrary objects (but with a specific type, and preferably a cdef class) in containers.

And std::map shouldn’t be compared to dict; std::map maintains keys in sorted order because it is a balanced tree, dict solves a different problem. A better comparison would be dict and Google’s hashtable.

Answered By: Andreas

Answer 5

Cython now has template support, and comes with declarations for some of the STL containers.

See http://docs.cython.org/src/userguide/wrapping_CPlusPlus.html#standard-library

Here’s the example they give:

from libcpp.vector cimport vector

cdef vector[int] vect
cdef int i
for i in range(10):
    vect.push_back(i)
for i in range(10):
    print vect[i]

Answered By: Sam Hartsfield

Answer 6

Just because it was not mentioned here: You can easily wrap for example a C++ vector in a custom extension type.

from libcpp.vector cimport vector

cdef class pyvector:
    """Extension type wrapping a vector"""
    cdef vector[long] _data

    cpdef void push_back(self, long x):
        self._data.push_back(x)

    @property
    def data(self):
        return self._data

In this way, you can store your data in a vector allowing fast Cython operations while still being able to access the data (with some overhead) from the Python side.

Answered By: Jan Joswig

Idiomatic way to do list/dict in Cython?

Question:

Answers: