What are the different use cases of joblib versus pickle?
Question:
Background: I’m just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.
it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string
I read this Q&A on Pickle,
Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?
Answers:
- joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
- joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
- if you don’t pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
- since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
- But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with
mmap_mode="r"
.
I came across same question, so i tried this one (with Python 2.7) as i need to load a large pickle file
#comapare pickle loaders
from time import time
import pickle
import os
try:
import cPickle
except:
print "Cannot import cPickle"
import joblib
t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1
Output for this is
time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392
According to this joblib works better than cPickle and Pickle module from these 3 modules. Thanks
Thanks to Gunjan for giving us this script! I modified it for Python3 results
#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib
file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)
time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891
Just a humble note …
Pickle is better for fitted scikit-learn estimators/ trained models. In ML applications trained models are saved and loaded back up for prediction mainly.
(I’m not able to comment, but nobody has mentioned this).
There’s still a difference between pickle files created by pickle
and joblib
when saving an instance of a custom class:
In the case of the standard pickle file, you must import/load into memory all the classes that were used to create the saved object (it could be multiple by inheritance/composition) to be able to load the object, if not, an error Can't get attribute 'my_custom_class' on <module '__main__' from 'main.py'>
will be thrown. In the case of joblib
, there is no need to import/load the classes’ definition/s.
PS: This is still true for python 3.10
and joblib > 1.0
Background: I’m just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle.
it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big data, but can only pickle to the disk and not to a string
I read this Q&A on Pickle,
Common use-cases for pickle in Python and wonder if the community here can share the differences between joblib and pickle? When should one use one over another?
- joblib is usually significantly faster on large numpy arrays because it has a special handling for the array buffers of the numpy datastructure. To find about the implementation details you can have a look at the source code. It can also compress that data on the fly while pickling using zlib or lz4.
- joblib also makes it possible to memory map the data buffer of an uncompressed joblib-pickled numpy array when loading it which makes it possible to share memory between processes.
- if you don’t pickle large numpy arrays, then regular pickle can be significantly faster, especially on large collections of small python objects (e.g. a large dict of str objects) because the pickle module of the standard library is implemented in C while joblib is pure python.
- since PEP 574 (Pickle protocol 5) has been merged in Python 3.8, it is now much more efficient (memory-wise and cpu-wise) to pickle large numpy arrays using the standard library. Large arrays in this context means 4GB or more.
- But joblib can still be useful with Python 3.8 to load objects that have nested numpy arrays in memory mapped mode with
mmap_mode="r"
.
I came across same question, so i tried this one (with Python 2.7) as i need to load a large pickle file
#comapare pickle loaders
from time import time
import pickle
import os
try:
import cPickle
except:
print "Cannot import cPickle"
import joblib
t1 = time()
lis = []
d = pickle.load(open("classi.pickle","r"))
print "time for loading file size with pickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
cPickle.load(open("classi.pickle","r"))
print "time for loading file size with cpickle", os.path.getsize("classi.pickle"),"KB =>", time()-t1
t1 = time()
joblib.load("classi.pickle")
print "time for loading file size joblib", os.path.getsize("classi.pickle"),"KB =>", time()-t1
Output for this is
time for loading file size with pickle 1154320653 KB => 6.75876188278
time for loading file size with cpickle 1154320653 KB => 52.6876490116
time for loading file size joblib 1154320653 KB => 6.27503800392
According to this joblib works better than cPickle and Pickle module from these 3 modules. Thanks
Thanks to Gunjan for giving us this script! I modified it for Python3 results
#comapare pickle loaders
from time import time
import pickle
import os
import _pickle as cPickle
from sklearn.externals import joblib
file = os.path.join(os.path.dirname(os.path.realpath(__file__)), 'database.clf')
t1 = time()
lis = []
d = pickle.load(open(file,"rb"))
print("time for loading file size with pickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
cPickle.load(open(file,"rb"))
print("time for loading file size with cpickle", os.path.getsize(file),"KB =>", time()-t1)
t1 = time()
joblib.load(file)
print("time for loading file size joblib", os.path.getsize(file),"KB =>", time()-t1)
time for loading file size with pickle 79708 KB => 0.16768312454223633
time for loading file size with cpickle 79708 KB => 0.0002372264862060547
time for loading file size joblib 79708 KB => 0.0006849765777587891
Just a humble note …
Pickle is better for fitted scikit-learn estimators/ trained models. In ML applications trained models are saved and loaded back up for prediction mainly.
(I’m not able to comment, but nobody has mentioned this).
There’s still a difference between pickle files created by pickle
and joblib
when saving an instance of a custom class:
In the case of the standard pickle file, you must import/load into memory all the classes that were used to create the saved object (it could be multiple by inheritance/composition) to be able to load the object, if not, an error Can't get attribute 'my_custom_class' on <module '__main__' from 'main.py'>
will be thrown. In the case of joblib
, there is no need to import/load the classes’ definition/s.
PS: This is still true for python 3.10
and joblib > 1.0