Pickle alternatives

Question:

I am trying to serialize a large (~10**6 rows, each with ~20 values) list, to be used later by myself (so pickle’s lack of safety isn’t a concern).

Each row of the list is a tuple of values, derived from some SQL database. So far, I have seen datetime.datetime, strings, integers, and NoneType, but I might eventually have to support additional data types.

For serialization, I’ve considered pickle (cPickle), json, and plain text – but only pickle saves the type information: json can’t serialize datetime.datetime, and plain text has its obvious disadvantages.

However, cPickle is pretty slow for data this large, and I’m looking for a faster alternative.

Asked By: Guy Adini

||

Answers:

I think you should give PyTables a look. It should be ridiculously fast, at least faster than using an RDBMS, since it’s very lax and doesn’t impose any read/write restrictions, plus you get a better interface for managing your data, at least compared to pickling it.

Answered By: Filip Dupanović

I usually serialize to plain text (*.csv) because I found it to be fastest. The csv module works quite well. See http://docs.python.org/library/csv.html

If you have to deal with unicode for your strings, check out the UnicodeReader and UnicodeWriter examples at the end.

If you serialize for your own future use, I guess it would suffice to know that you have the same data type per csv column (e.g., string are always on column 2).

Answered By: Bogdan Vasilescu

Pickle is actually quite fast so long as you aren’t using the (default) ASCII protocol. Just make sure to dump using protocol=pickle.HIGHEST_PROTOCOL.

Answered By: Jake Biesinger

Protocol buffers are a flexible, efficient, automated mechanism for
serializing structured data – think XML, but smaller, faster, and
simpler.

advantages over XML:

  • are simpler
  • are 3 to 10 times smaller
  • are 20 to 100 times faster
  • are less ambiguous
  • generate data access classes that are easier to use programmatically

https://developers.google.com/protocol-buffers/docs/pythontutorial

Answered By: gustavodiazjaimes

Depending on what exactly you want to store, there are other alternatives:

The way to compare those is:

  • Ease of use / Programming language support / Tooling support
  • Being readable by a human
  • Storage size
  • Read-time
  • Write-time
  • Features: (1) Append data (2) Read single row (3) having a schema
Answered By: Martin Thoma

For hundreds of thousands of simple (up to JSON-compatible) complexity Python objects, I’ve found the best combination of simplicity, speed, and size by combining:

It beats pickle and cPickle options by orders of magnitude.

with gzip.open(filename, 'wb') as f:
    ubjson.dump(items, f)


with gzip.open(filename, 'rb') as f:
    return ubjson.load(f)
Answered By: Apalala

Avro seems to be promising and properly designed but yet non popular solution.

Answered By: SergeyR

Just for the sake of completeness – there is also dill library that extends pickle.

How to dill (pickle) to file?

Answered By: sophros
Categories: questions Tags: ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.