how to hash a "json" nested dictionary identically in Python and JavaScript?

Question:

What’s the best way to consistently hash an object/dictionary that’s limited to what JSON can represent, in both JavaScript and Python? What about in many different languages?

Of course there are hash functions implemented consistently in many different languages that take a string, but to hash an object you have to convert it to a string representation first.

I want a hash function that will always return the same value for the same dictionary in any language, but the JSON spec doesn’t guarantee anything about the order of keys in the serialized representation.

Do json.dumps() and JSON.stringify() behave identically? How would you verify this?

If not, is there a serialization format with libraries in many languages (I’m practically interested in Python and JavaScript but also curious about all languages) that doesn’t require any additional processing by the caller to produce consistent results?

Asked By: mwhite

||

Answers:

JSON is a well-defined language for representing the state of objects. The functions do not behave identically, but they do behave equivalently.

For instance:

json.dumps({'hello':'goodbye', 123: 456})

May produce either:

{"hello":"goodbye", "123": 456}

or

{"123": 456, "hello":"goodbye"}

And if you pass in the indent parameter then you get even more possibilities for different results.

Most languages if they do not already have a built-in way to handle JSON (e.g. Python & JS) then they’ll have a 3rd party utility that is perfectly sufficient (see Newtonsoft JSON library for .NET)

Each language that I’m aware of will produce valid JSON, which means that it can be parsed by each other language that provides a JSON parser.

Answered By: Wayne Werner

I would split this into two problems.

  1. How do you get the same serialized string in both JavaScript and Python?
  2. Which byte array hash function should you use? It must be an established algorithm with identical implementations in both JavaScript and Python.

Use (1) to get two strings, then UTF8 encode, then use (2) to get hashes.

Since (2) is straightforward, I’ll only address (1).

There are multiple facets to the problem of making sure the two JSON strings you generate are identical.

  • You’ll want to used unformatted JSON (no extraneous spaces, tabs, or newlines).
  • null values must be treated identically. Some serializers will by default throw away a dictionary key-value pair if the value is null.
  • Ordering of key-value pairs within a dictionary must be consistent.
  • JSON number serialization should be consistent. For example, you can’t have integer one serialize as 1 on one side and 1.0 on the other. (This probably won’t be as big of an issue however.)
  • The string encoding should be the same for both. JSON allows serialization to Unicode text, only mandating that " and be backslash-escaped in JSON strings. Most serializers do more than necessary, however, and reduce almost all Unicode characters to the uXXXX equivalent. See json.org for the details on JSON string encoding. One way to remove all ambiguity is to only escape when absolutely necessary.

You’ll want to make sure all of these are matched between JavaScript and Python. Most JSON serialization libraries I’ve used provide configuration hooks for all of the things I mention in the list above. Unfortunately, I’m not very familiar with the JavaScript or Python libraries.

Answered By: Timothy Shields

I thought I might attempt a practical example.

In javascript I did:

import stringify from 'json-stable-stringify'
import sha256 from 'simple-sha256'

hash_str = sha256(stringify({'hello':'goodbye', '123': 456}))
// hash_str = 72804f4e0847a477ee69eae4fbf404b03a6c220bacf8d5df34c964985acd473f

json-stable-stringify guarantees a sorted json. sha256 allows for nodejs / browser compatibility.

In python 3.8 I did:

import hashlib
import json

hash_str = hashlib.sha256(json.dumps({'hello':'goodbye', '123': 456}, sort_keys=True, separators=(',', ':')).encode("utf-8")).hexdigest()
# hash_str = 72804f4e0847a477ee69eae4fbf404b03a6c220bacf8d5df34c964985acd473f

I haven’t yet done extensive testing but with the json objects I’ve tried, it has successfully matched.

Answered By: pchtsp

In python, you can do this with merkle-json which can generate a unique hash for any dict (in python) or json object, it also supports lists and nested objects.

After installing:

pip install merkle-json

use it like this:

from merkle_json import MerkleJson

mj = MerkleJson()
obj = {
    'keyC': [3,4],
    'keyA': 2,
    'keyB': 4,
    'keyD': 1,
}
mj.hash(obj) # '7001bd2b415e6a624a23d7bc7c249b21'

Answered By: gent
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.