Python 3 – Which one is faster for accessing data: dataclasses or dictionaries?

Question:

Python 3.7 introduced dataclasses to store data. I’m considering to move to this new approach which is more organized and well structured than a dict.

But I have a doubt. Python transforms keys into hashes on dicts and that makes looking for keys and values much faster. Dataclasses implement something like it?

Which one is faster and why?

Asked By: sergiomafra

||

Answers:

All classes in python actually use a dictionary under the hood to store their attributes, as you can read here in the documentation. For a more in-depth reference on how python classes (and many more things) work, you can also check out the article on python’s datamodel, in particular the section on custom classes.

So in general, there shouldn’t be a loss in performance by moving from dictionaries to dataclasses. But it’s better to make sure with the timeit module:


Baseline

# dictionary creation
$ python -m timeit "{'var': 1}"
5000000 loops, best of 5: 52.9 nsec per loop

# dictionary key access
$ python -m timeit -s "d = {'var': 1}" "d['var']"
10000000 loops, best of 5: 20.3 nsec per loop

Basic dataclass

# dataclass creation
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: var: int" "A(1)" 
1000000 loops, best of 5: 288 nsec per loop

# dataclass attribute access
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: var: int" -s "a = A(1)" "a.var" 
10000000 loops, best of 5: 25.3 nsec per loop

Here we can see that using classes does have some overhead. For class creation it’s quite a bit (~5 times slower), but you don’t necessarily need to care that much about it as long as you don’t plan to create and toss your dataclasses multiple times per second.

The attribute access is probably the more important metric, and while dataclasses are again slower (~1.25 times), this time it’s not by that much.

If you think that’s still a tad too slow, you can tune your dataclass (or any classes, really) by using slots instead of a dictionary to store their attributes:


Slotted dataclass

# dataclass creation
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: __slots__ = ('var',); var: int" "A(1)" 
1000000 loops, best of 5: 242 nsec per loop

# dataclass attribute access
$ python -m timeit -s "from dataclasses import dataclass" -s "@dataclass" -s "class A: __slots__ = ('var',); var: int" -s "a = A(1)" "a.var"
10000000 loops, best of 5: 21.7 nsec per loop

By using this pattern we could shave off a few more more nanoseconds. At this point, at least regarding attribute access, there shouldn’t be a noticeable difference to dictionaries any more, and you can use the upsides of dataclasses without compromising speed.

Answered By: Arne

@Arne has an excellent answer and proved that dicts are indeed the faster of the two. Let me just add a couple things:

As I mentioned in my comment here, as-of Python 3.10, there is the @dataclass(slots=True) option that creates a dataclass with slot members, exactly as in the faster of Arne’s examples. Not much reason to ever not use slots=True, unless you know you need it.

Now on to the other, lesser known alternative. One of the main reasons you might pick a dataclass over a dict is for IDE hints (e.g. intellisense) and a sanity check that the expected key exists. Since python 3.8, there has been the PEP589 TypedDict, which does allows that for the standard format of a dictionary. Consider the following:

from typing import TypedDict

class Movie(TypedDict):
    name: str
    year: int

movie: Movie = {'name': 'Blade Runner',
                'year': 1982}

In this case, your IDE will be able to hint to you which keys are valid, and show a correct init function:

IDE screenshot access IDE screenshot init

Additionally, mypy will be able to tell you if there’s an error in key access; more or less, TypedDicts get you a few of the big dataclass benefits without using dataclasses. Overall, it’s a good solution in cases where you’re working with dictionaries already, or still need dictionary things like easy nestability and slightly better performance.* See the above PEP link for lots of good examples.

* the performance numbers are trivial – if dataclasses make your life easier, use them. Don’t prematurely optimize to something that isn’t a shoe-in. Too many programmers make things harder for themselves trying to shave off nanosecnds rather than taking a look at the bigger picture of what their code is doing.

Answered By: Trevor Gross

While I’m a big fan of dataclasses and they often lead to more elegant case, the performance difference can actually be massive. We recently refactored a data processing application that used dicts to use dataclasses instead, and saw throughput drop by over 100x. Payloads that would previously take milliseconds to process, were taking several seconds.

The code doesn’t do anything particularly convoluted, but does map various entries between data structures. Profiling the runs indicated that pretty much all the execution time is taken up by various built-in dataclass methods (especially _asdict_inner(), which took up about 30% of total time), as these were executed whenever any data manipulation took place – e.g. merging one structure into another. Using slotted dataclasses only led to a ~10% speedup. I’m sure other improvements would have been possible, but the gap was so huge that it didn’t seem worth it.

We switched back to using TypedDicts, and performance returned to the original levels. TypedDicts don’t have all the benefits of dataclasses (like type-checking and enforcement at runtime), but the trade-off seems like a no-brainer for applications that are in any way performance-sensitive.

Answered By: Svet