Pickle alternatives
Question:
I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle’s lack of safety isn’t a concern).
Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime
, strings, integers, and NoneType, but I might eventually have to support additional data types.
For serialization, I’ve considered pickle (cPickle), json, and plain text – but only pickle saves the type information: json can’t serialize datetime.datetime
, and plain text has its obvious disadvantages.
However, cPickle is pretty slow for data this large, and I’m looking for a faster alternative.
Answers:
I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it’s very lax and doesn’t impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.
I usually serialize to plain text (*.csv) because I found it to be fastest. The csv module works quite well. See http://docs.python.org/library/csv.html
If you have to deal with unicode for your strings, check out the UnicodeReader and UnicodeWriter examples at the end.
If you serialize for your own future use, I guess it would suffice to know that you have the same data type per csv column (e.g., string are always on column 2).
Pickle is actually quite fast so long as you aren’t using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL
.
Protocol buffers are a flexible, efficient, automated mechanism for
serializing structured data – think XML, but smaller, faster, and
simpler.
advantages over XML:
- are simpler
- are 3 to 10 times smaller
- are 20 to 100 times faster
- are less ambiguous
- generate data access classes that are easier to use programmatically
https://developers.google.com/protocol-buffers/docs/pythontutorial
Protocol Buffer – e.g. used in Caffe; maintains type information, but you have to put quite much effort in it compared to pickle
- MessagePack: See python package – supports streaming (source)
- BSON: see python package docs
Depending on what exactly you want to store, there are other alternatives:
The way to compare those is:
- Ease of use / Programming language support / Tooling support
- Being readable by a human
- Storage size
- Read-time
- Write-time
- Features: (1) Append data (2) Read single row (3) having a schema
For hundreds of thousands of simple (up to JSON-compatible) complexity Python objects, I’ve found the best combination of simplicity, speed, and size by combining:
It beats pickle
and cPickle
options by orders of magnitude.
with gzip.open(filename, 'wb') as f:
ubjson.dump(items, f)
with gzip.open(filename, 'rb') as f:
return ubjson.load(f)
Avro seems to be promising and properly designed but yet non popular solution.
Just for the sake of completeness – there is also dill
library that extends pickle
.
I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle’s lack of safety isn’t a concern).
Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime
, strings, integers, and NoneType, but I might eventually have to support additional data types.
For serialization, I’ve considered pickle (cPickle), json, and plain text – but only pickle saves the type information: json can’t serialize datetime.datetime
, and plain text has its obvious disadvantages.
However, cPickle is pretty slow for data this large, and I’m looking for a faster alternative.
I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it’s very lax and doesn’t impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.
I usually serialize to plain text (*.csv) because I found it to be fastest. The csv module works quite well. See http://docs.python.org/library/csv.html
If you have to deal with unicode for your strings, check out the UnicodeReader and UnicodeWriter examples at the end.
If you serialize for your own future use, I guess it would suffice to know that you have the same data type per csv column (e.g., string are always on column 2).
Pickle is actually quite fast so long as you aren’t using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL
.
Protocol buffers are a flexible, efficient, automated mechanism for
serializing structured data – think XML, but smaller, faster, and
simpler.advantages over XML:
- are simpler
- are 3 to 10 times smaller
- are 20 to 100 times faster
- are less ambiguous
- generate data access classes that are easier to use programmatically
https://developers.google.com/protocol-buffers/docs/pythontutorial
Protocol Buffer – e.g. used in Caffe; maintains type information, but you have to put quite much effort in it compared to pickle- MessagePack: See python package – supports streaming (source)
- BSON: see python package docs
Depending on what exactly you want to store, there are other alternatives:
The way to compare those is:
- Ease of use / Programming language support / Tooling support
- Being readable by a human
- Storage size
- Read-time
- Write-time
- Features: (1) Append data (2) Read single row (3) having a schema
For hundreds of thousands of simple (up to JSON-compatible) complexity Python objects, I’ve found the best combination of simplicity, speed, and size by combining:
It beats pickle
and cPickle
options by orders of magnitude.
with gzip.open(filename, 'wb') as f:
ubjson.dump(items, f)
with gzip.open(filename, 'rb') as f:
return ubjson.load(f)
Avro seems to be promising and properly designed but yet non popular solution.
Just for the sake of completeness – there is also dill
library that extends pickle
.