How is it that json serialization is so much faster than yaml serialization in Python?

Question:

I have code that relies heavily on yaml for cross-language serialization and while working on speeding some stuff up I noticed that yaml was insanely slow compared to other serialization methods (e.g., pickle, json).

So what really blows my mind is that json is so much faster that yaml when the output is nearly identical.

>>> import yaml, cjson; d={'foo': {'bar': 1}}
>>> yaml.dump(d, Dumper=yaml.SafeDumper)
'foo: {bar: 1}n'
>>> cjson.encode(d)
'{"foo": {"bar": 1}}'
>>> import yaml, cjson;
>>> timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
44.506911039352417
>>> timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml; d={'foo': {'bar': 1}}", number=10000)
16.852826118469238
>>> timeit("cjson.encode(d)", setup="import cjson; d={'foo': {'bar': 1}}", number=10000)
0.073784112930297852

PyYaml’s CSafeDumper and cjson are both written in C so it’s not like this is a C vs Python speed issue. I’ve even added some random data to it to see if cjson is doing any caching, but it’s still way faster than PyYaml. I realize that yaml is a superset of json, but how could the yaml serializer be 2 orders of magnitude slower with such simple input?

Asked By: guidoism

||

Answers:

In general, it’s not the complexity of the output that determines the speed of parsing, but the complexity of the accepted input. The JSON grammar is very concise. The YAML parsers are comparatively complex, leading to increased overheads.

JSON’s foremost design goal is
simplicity and universality. Thus,
JSON is trivial to generate and parse,
at the cost of reduced human
readability. It also uses a lowest
common denominator information model,
ensuring any JSON data can be easily
processed by every modern programming
environment.

In contrast, YAML’s foremost design
goals are human readability and
support for serializing arbitrary
native data structures. Thus, YAML
allows for extremely readable files,
but is more complex to generate and
parse. In addition, YAML ventures
beyond the lowest common denominator
data types, requiring more complex
processing when crossing between
different programming environments.

I’m not a YAML parser implementor, so I can’t speak specifically to the orders of magnitude without some profiling data and a big corpus of examples. In any case, be sure to test over a large body of inputs before feeling confident in benchmark numbers.

Update Whoops, misread the question. 🙁 Serialization can still be blazingly fast despite the large input grammar; however, browsing the source, it looks like PyYAML’s Python-level serialization constructs a representation graph whereas simplejson encodes builtin Python datatypes directly into text chunks.

Answered By: cdleary

A cursory look at python-yaml suggests its design is much more complex than cjson’s:

>>> dir(cjson)
['DecodeError', 'EncodeError', 'Error', '__doc__', '__file__', '__name__', '__package__', 
'__version__', 'decode', 'encode']

>>> dir(yaml)
['AliasEvent', 'AliasToken', 'AnchorToken', 'BaseDumper', 'BaseLoader', 'BlockEndToken',
 'BlockEntryToken', 'BlockMappingStartToken', 'BlockSequenceStartToken', 'CBaseDumper',
'CBaseLoader', 'CDumper', 'CLoader', 'CSafeDumper', 'CSafeLoader', 'CollectionEndEvent', 
'CollectionNode', 'CollectionStartEvent', 'DirectiveToken', 'DocumentEndEvent', 'DocumentEndToken', 
'DocumentStartEvent', 'DocumentStartToken', 'Dumper', 'Event', 'FlowEntryToken', 
'FlowMappingEndToken', 'FlowMappingStartToken', 'FlowSequenceEndToken', 'FlowSequenceStartToken', 
'KeyToken', 'Loader', 'MappingEndEvent', 'MappingNode', 'MappingStartEvent', 'Mark', 
'MarkedYAMLError', 'Node', 'NodeEvent', 'SafeDumper', 'SafeLoader', 'ScalarEvent', 
'ScalarNode', 'ScalarToken', 'SequenceEndEvent', 'SequenceNode', 'SequenceStartEvent', 
'StreamEndEvent', 'StreamEndToken', 'StreamStartEvent', 'StreamStartToken', 'TagToken', 
'Token', 'ValueToken', 'YAMLError', 'YAMLObject', 'YAMLObjectMetaclass', '__builtins__', 
'__doc__', '__file__', '__name__', '__package__', '__path__', '__version__', '__with_libyaml__', 
'add_constructor', 'add_implicit_resolver', 'add_multi_constructor', 'add_multi_representer', 
'add_path_resolver', 'add_representer', 'compose', 'compose_all', 'composer', 'constructor', 
'cyaml', 'dump', 'dump_all', 'dumper', 'emit', 'emitter', 'error', 'events', 'load', 
'load_all', 'loader', 'nodes', 'parse', 'parser', 'reader', 'representer', 'resolver', 
'safe_dump', 'safe_dump_all', 'safe_load', 'safe_load_all', 'scan', 'scanner', 'serialize', 
'serialize_all', 'serializer', 'tokens']

More complex designs almost invariably mean slower designs, and this is far more complex than most people will ever need.

Answered By: Glenn Maynard

Speaking about efficiency, I used YAML for a time and felt attracted by the simplicity that some name/value assignments take on in this language. However, in the process I tripped so and so often about one of YAML’s finesses, subtle variations in the grammar that allow you to write special cases in a more concise style and such. In the end, although YAML’s grammar is almost for certain formally consistent, it has left me with a certain feeling of ‘vagueness’. I then restricted myself to not touch existing, working YAML code and write everything new in a more roundabout, fail-safe syntax—which made me abandon all of YAML. The upshot is that YAML tries to look like a W3C standard, and produces a small library of hard to read literature concerning its concepts and rules.

This, I feel, is by far more intellectual overhead than needed. Look at SGML/XML: developed by IBM in the roaring 60s, standardized by the ISO, known (in a dumbed-down and modified form) as HTML to uncounted millions of people, documented and documented and documented again the world over. Comes up little JSON and slays that dragon. How could JSON become so widely used in so short a time, with just one meager website (and a javascript luminary to back it)? It is in its simplicity, the sheer absence of doubt in its grammar, the ease of learning and using it.

XML and YAML are hard for humans, and they are hard for computers. JSON is quite friendly and easy to both humans and computers.

Answered By: flow

In applications I’ve worked on, the type inference between strings to numbers (float/int) is where the largest overhead is for parsing yaml is because strings can be written without quotes. Because all strings in json are in quotes there is no backtracking when parsing strings. A great example where this would slow down is the value 0000000000000000000s. You cannot tell this value is a string until you’ve read to the end of it.

The other answers are correct but this is a specific detail that I’ve discovered in practice.

Answered By: twosnac

Although you have an accepted answer, unfortunately that only does
some handwaving in the direction of the PyYAML documentation and
quotes a statement in that documentation that is not correct: PyYAML
does not make a representation graph during dumping, it creates a
lineair stream (and just like json keeps a bucket of IDs to see if there are
recursions).


First of all you have to realize that while the cjson dumper is
handcrafted C-code only, YAML’s CSafeDumper shares two of the four dump stages
(Representer and Resolver) with the normal pure Python SafeDumper
and that the other two stages (the Serializer and Emitter) are not
written completely handcrafted in C, but consist of a Cython module
which calls the C library libyaml for emitting.


Apart from that significant part, the simple answer to your question
why it takes longer, is that dumping YAML does more. This is not so
much because YAML is harder as @flow claims, but because that extra
that YAML can do, makes it so much more powerful than JSON and also more
user friendly, if you need to process the result with an editor. That
means more time is spent in the YAML library even when applying these extra features,
and in many cases also just checking if something applies.

Here is an example: even if you have never gone through the PyYAML
code, you’ll have noticed that the dumper doesn’t quote foo and
bar. That is not because these strings are are keys, as YAML doesn’t
have the restriction that JSON has, that a key for a mapping needs to
be string. E.g. a Python string that is a value in mapping can
also be unquoted (i.e. plain).

The emphasis is on can, because it is not always so. Take for
instance a string that consists of numeral characters only:
12345678. This needs to be written out with quotes as otherwise this
would look exactly like a number (and read back in as such when parsing).

How does PyYAML know when to quote a string and when not? On dumping
it actually first dumps the string, then parses the result to make
sure, that when it reads that result back, it gets the original value.
And if that proves not to be the case, it applies quotes.

Let me repeat the important part of the previous sentence again, so
you don’t have to re-read it:

it dumps the string, then parses the result

This means it applies all of the regex matching it does when
loading to see if the resulting scalar would load as an integer,
float, boolean, datetime, etc., to determine whether quotes need to be
applied or not.¹


In any real application with complex data, a JSON based
dumper/loader is too simple to use directly and a lot more
intelligence has to be in your program compared to dumping the same
complex data directly to YAML. A simplified example is when you want to work
with date-time stamps, in that case you have to convert a string back
and forth to datetime.datetime yourself if you are using JSON. During loading
you have to do that either based on the fact that this is a value
associated with some (hopefully recognisable) key:

{ "datetime": "2018-09-03 12:34:56" }

or with a position in a list:

["FirstName", "Lastname", "1991-09-12 08:45:00"]

or based on the format of the string (e.g. using regex).

In all of these cases much more work needs to be done in your program. The same
holds for dumping and that does not only mean extra development time.

Lets regenerate your timings with what I get on my machine
so we can compare them with other measurements. I rewrote your code
somewhat, because it was incomplete (timeit?) and imported other
things twice. It was also impossible to just cut and paste because of the >>> prompts.

from __future__ import print_function

import sys
import yaml
import cjson
from timeit import timeit

NR=10000
ds = "; d={'foo': {'bar': 1}}"
d = {'foo': {'bar': 1}}

print('yaml.SafeDumper:', end=' ')
yaml.dump(d, sys.stdout, Dumper=yaml.SafeDumper)
print('cjson.encode:   ', cjson.encode(d))
print()


res = timeit("yaml.dump(d, Dumper=yaml.SafeDumper)", setup="import yaml"+ds, number=NR)
print('yaml.SafeDumper ', res)
res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup="import yaml"+ds, number=NR)
print('yaml.CSafeDumper', res)
res = timeit("cjson.encode(d)", setup="import cjson"+ds, number=NR)
print('cjson.encode    ', res)

and this outputs:

yaml.SafeDumper: foo: {bar: 1}
cjson.encode:    {"foo": {"bar": 1}}

yaml.SafeDumper  3.06794905663
yaml.CSafeDumper 0.781533956528
cjson.encode     0.0133550167084

Now lets
dump a simple data structure that includes a datetime

import datetime
from collections import Mapping, Sequence  # python 2.7 has no .abc

d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}

def stringify(x, key=None):
    # key parameter can be used to dump
    if isinstance(x, str):
       return x
    if isinstance(x, Mapping):
       res = {}
       for k, v in x.items():
           res[stringify(k, key=True)] = stringify(v)  # 
       return res
    if isinstance(x, Sequence):
        res = [stringify(k) for k in x]
        if key:
            res = repr(res)
        return res
    if isinstance(x, datetime.datetime):
        return x.isoformat(sep=' ')
    return repr(x)

print('yaml.CSafeDumper:', end=' ')
yaml.dump(d, sys.stdout, Dumper=yaml.CSafeDumper)
print('cjson.encode:    ', cjson.encode(stringify(d)))
print()

This gives:

yaml.CSafeDumper: foo: {bar: '1991-09-12 08:45:00'}
cjson.encode:     {"foo": {"bar": "1991-09-12 08:45:00"}}

For the timing of the above I created a module myjson that wraps
cjson.encode and has the above stringify defined. If you use that:

d = {'foo': {'bar': datetime.datetime(1991, 9, 12, 8, 45, 0)}}
ds = 'import datetime, myjson, yaml; d=' + repr(d)
res = timeit("yaml.dump(d, Dumper=yaml.CSafeDumper)", setup=ds, number=NR)
print('yaml.CSafeDumper', res)
res = timeit("myjson.encode(d)", setup=ds, number=NR)
print('cjson.encode    ', res)

giving:

yaml.CSafeDumper 0.813436031342
cjson.encode     0.151570081711

That still rather simple output, already brings you back from two orders
of magnitude difference in speed to less than only one order of magnitude.


YAML’s plain scalars and block style formatting make for better readable data.
That you can have a trailing comma in a sequence (or mapping) makes for
less failures when manually editing YAML data as with same data in JSON.

YAML tags allow for in-data indication of your (complex) types. When
using JSON you have to take care, in your code, of anything more
complex than mappings, sequences, integers, floats, booleans and
strings. Such code requires development time, and is unlikely to be
as fast as python-cjson (you are of course free to write your code
in C as well.

Dumping some data, like recursive data-structures (e.g. topological
data), or complex keys is pre-defined in the PyYAML library. There the
JSON library just errors out, and implement workaround for that is
non-trivial and most likely slows things that speed differences are less relevant.

Such power and flexibility comes at a price of lower speed. When
dumping many simple things JSON is the better choice, you are unlikely
going to edit the result by hand anyway. For anyting that involves
editing or complex objects or both, you should still consider using
YAML.


¹ It is possible to force dumping of all Python strings as YAML
scalars with (double) quotes, but setting the style is not enough to
prevent all readback.

Answered By: Anthon
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.