Pickle or json?
Question:
I need to save to disk a little dict
object whose keys are of the type str
and values are int
s and then recover it. Something like this:
{'juanjo': 2, 'pedro':99, 'other': 333}
What is the best option and why? Serialize it with pickle
or with simplejson
?
I am using Python 2.6.
Answers:
If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.
If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).
I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle
to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.
JSON or pickle? How about JSON and pickle!
You can use jsonpickle
. It easy to use and the file on disk is readable because it’s JSON.
You might also find this interesting, with some charts to compare: http://kovshenin.com/archives/pickle-vs-json-which-is-faster/
Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won’t take, than use pickle.
But for most data storage, you won’t need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.
The speed is nice, but for most datasets the difference is negligible; Python generally isn’t too fast anyways.
If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.
If you are more concerned with interoperability, security, and/or human readability, then use JSON.
The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle protocol 2 show:
- cPickle 3.8x faster loading
- cPickle 1.5x faster reading
- cPickle slightly smaller encoding
Reproduce this yourself with this gist, which is based on the Konstantin’s benchmark referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.
wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py
Results with python 2.7 on a decent 2015 Xeon processor:
Dir Entries Method Time Length
dump 10 JSON 0.017 1484510
load 10 JSON 0.375 -
dump 10 Pickle 0.011 1428790
load 10 Pickle 0.098 -
dump 20 JSON 0.036 2969020
load 20 JSON 1.498 -
dump 20 Pickle 0.022 2857580
load 20 Pickle 0.394 -
dump 50 JSON 0.079 7422550
load 50 JSON 9.485 -
dump 50 Pickle 0.055 7143950
load 50 Pickle 2.518 -
dump 100 JSON 0.165 14845100
load 100 JSON 37.730 -
dump 100 Pickle 0.107 14287900
load 100 Pickle 9.907 -
I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)
is the fastest dump method.
import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np
num_tests = 10
obj = np.random.normal(0.5, 1, [240, 320, 3])
command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)
command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle: %f seconds" % result)
command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest: %f seconds" % result)
command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json: %f seconds" % result)
command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack: %f seconds" % result)
Output:
pickle : 0.847938 seconds
cPickle : 0.810384 seconds
cPickle highest: 0.004283 seconds
json : 1.769215 seconds
msgpack : 0.270886 seconds
Most answers are quite old and miss some info.
For the statement "Unpickling can run arbitrary code":
- Check the example in https://docs.python.org/3/library/pickle.html#restricting-globals
import pickle
pickle.loads(b"cosnsystemn(S'echo hello world'ntR.")
pickle.loads(b"cosnsystemn(S'pwd'ntR.")
pwd
can be replaced e.g. by rm
to delete files.
- Check https://checkoway.net/musings/pickle/ for more sophisicated "run arbitrary code" template. The code is written in python2.7 but I guess with some modification, could also work in python3. If you make it work in python3, please add the python3 version my answer. 🙂
For the "pickle speed vs json" part:
Firstly, there is no explicit cpickle
in python3 now .
And for this test code borrowed from another answer, pickle
beats json
in all:
import pickle
import json, random
from time import time
from hashlib import md5
test_runs = 100000
if __name__ == "__main__":
payload = {
"float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
"int": [random.randrange(0, 9999) for i in range(1000)],
"str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
}
modules = [json, pickle]
for payload_type in payload:
data = payload[payload_type]
for module in modules:
start = time()
if module.__name__ in ['pickle']:
for i in range(test_runs): serialized = module.dumps(data)
else:
for i in range(test_runs):
# print(i)
serialized = module.dumps(data)
w = time() - start
start = time()
for i in range(test_runs):
unserialized = module.loads(serialized)
r = time() - start
print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))
result:
tian@tian-B250M-Wind:~/playground/pickle_vs_json$ p3 pickle_test.py
json float W 41.775 R 26.738
pickle float W 1.272 R 2.286
json int W 5.142 R 4.974
pickle int W 0.589 R 1.352
json str W 10.379 R 4.626
pickle str W 3.062 R 3.294
I need to save to disk a little dict
object whose keys are of the type str
and values are int
s and then recover it. Something like this:
{'juanjo': 2, 'pedro':99, 'other': 333}
What is the best option and why? Serialize it with pickle
or with simplejson
?
I am using Python 2.6.
If you do not have any interoperability requirements (e.g. you are just going to use the data with Python) and a binary format is fine, go with cPickle which gives you really fast Python object serialization.
If you want interoperability or you want a text format to store your data, go with JSON (or some other appropriate format depending on your constraints).
I prefer JSON over pickle for my serialization. Unpickling can run arbitrary code, and using pickle
to transfer data between programs or store data between sessions is a security hole. JSON does not introduce a security hole and is standardized, so the data can be accessed by programs in different languages if you ever need to.
JSON or pickle? How about JSON and pickle!
You can use jsonpickle
. It easy to use and the file on disk is readable because it’s JSON.
You might also find this interesting, with some charts to compare: http://kovshenin.com/archives/pickle-vs-json-which-is-faster/
Personally, I generally prefer JSON because the data is human-readable. Definitely, if you need to serialize something that JSON won’t take, than use pickle.
But for most data storage, you won’t need to serialize anything weird and JSON is much easier and always allows you to pop it open in a text editor and check out the data yourself.
The speed is nice, but for most datasets the difference is negligible; Python generally isn’t too fast anyways.
If you are primarily concerned with speed and space, use cPickle because cPickle is faster than JSON.
If you are more concerned with interoperability, security, and/or human readability, then use JSON.
The tests results referenced in other answers were recorded in 2010, and the updated tests in 2016 with cPickle protocol 2 show:
- cPickle 3.8x faster loading
- cPickle 1.5x faster reading
- cPickle slightly smaller encoding
Reproduce this yourself with this gist, which is based on the Konstantin’s benchmark referenced in other answers, but using cPickle with protocol 2 instead of pickle, and using json instead of simplejson (since json is faster than simplejson), e.g.
wget https://gist.github.com/jdimatteo/af317ef24ccf1b3fa91f4399902bb534/raw/03e8dbab11b5605bc572bc117c8ac34cfa959a70/pickle_vs_json.py
python pickle_vs_json.py
Results with python 2.7 on a decent 2015 Xeon processor:
Dir Entries Method Time Length
dump 10 JSON 0.017 1484510
load 10 JSON 0.375 -
dump 10 Pickle 0.011 1428790
load 10 Pickle 0.098 -
dump 20 JSON 0.036 2969020
load 20 JSON 1.498 -
dump 20 Pickle 0.022 2857580
load 20 Pickle 0.394 -
dump 50 JSON 0.079 7422550
load 50 JSON 9.485 -
dump 50 Pickle 0.055 7143950
load 50 Pickle 2.518 -
dump 100 JSON 0.165 14845100
load 100 JSON 37.730 -
dump 100 Pickle 0.107 14287900
load 100 Pickle 9.907 -
I have tried several methods and found out that using cPickle with setting the protocol argument of the dumps method as: cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)
is the fastest dump method.
import msgpack
import json
import pickle
import timeit
import cPickle
import numpy as np
num_tests = 10
obj = np.random.normal(0.5, 1, [240, 320, 3])
command = 'pickle.dumps(obj)'
setup = 'from __main__ import pickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("pickle: %f seconds" % result)
command = 'cPickle.dumps(obj)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle: %f seconds" % result)
command = 'cPickle.dumps(obj, protocol=cPickle.HIGHEST_PROTOCOL)'
setup = 'from __main__ import cPickle, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("cPickle highest: %f seconds" % result)
command = 'json.dumps(obj.tolist())'
setup = 'from __main__ import json, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("json: %f seconds" % result)
command = 'msgpack.packb(obj.tolist())'
setup = 'from __main__ import msgpack, obj'
result = timeit.timeit(command, setup=setup, number=num_tests)
print("msgpack: %f seconds" % result)
Output:
pickle : 0.847938 seconds
cPickle : 0.810384 seconds
cPickle highest: 0.004283 seconds
json : 1.769215 seconds
msgpack : 0.270886 seconds
Most answers are quite old and miss some info.
For the statement "Unpickling can run arbitrary code":
- Check the example in https://docs.python.org/3/library/pickle.html#restricting-globals
import pickle
pickle.loads(b"cosnsystemn(S'echo hello world'ntR.")
pickle.loads(b"cosnsystemn(S'pwd'ntR.")
pwd
can be replaced e.g. by rm
to delete files.
- Check https://checkoway.net/musings/pickle/ for more sophisicated "run arbitrary code" template. The code is written in python2.7 but I guess with some modification, could also work in python3. If you make it work in python3, please add the python3 version my answer. 🙂
For the "pickle speed vs json" part:
Firstly, there is no explicit cpickle
in python3 now .
And for this test code borrowed from another answer, pickle
beats json
in all:
import pickle
import json, random
from time import time
from hashlib import md5
test_runs = 100000
if __name__ == "__main__":
payload = {
"float": [(random.randrange(0, 99) + random.random()) for i in range(1000)],
"int": [random.randrange(0, 9999) for i in range(1000)],
"str": [md5(str(random.random()).encode('utf8')).hexdigest() for i in range(1000)]
}
modules = [json, pickle]
for payload_type in payload:
data = payload[payload_type]
for module in modules:
start = time()
if module.__name__ in ['pickle']:
for i in range(test_runs): serialized = module.dumps(data)
else:
for i in range(test_runs):
# print(i)
serialized = module.dumps(data)
w = time() - start
start = time()
for i in range(test_runs):
unserialized = module.loads(serialized)
r = time() - start
print("%s %s W %.3f R %.3f" % (module.__name__, payload_type, w, r))
result:
tian@tian-B250M-Wind:~/playground/pickle_vs_json$ p3 pickle_test.py
json float W 41.775 R 26.738
pickle float W 1.272 R 2.286
json int W 5.142 R 4.974
pickle int W 0.589 R 1.352
json str W 10.379 R 4.626
pickle str W 3.062 R 3.294