sklearn doesn't have attribute 'datasets'

Question:

I have started using sckikit-learn for my work. So I was going through the tutorial which gives standard procedure to load some datasets:

$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()

However, for my convenience, I tried loading the data in the following way:

In [1]: import sklearn

In [2]: iris = sklearn.datasets.load_iris()

However, this throws following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-db77d2036db5> in <module>()
----> 1 iris = sklearn.datasets.load_iris()

AttributeError: 'module' object has no attribute 'datasets'

However, if I use the apparently similar method:

In [3]: from sklearn import datasets

In [4]: iris = datasets.load_iris()

It works without problem. In fact the following also works:

In [5]: iris = sklearn.datasets.load_iris()

I am completely confused about this. Am I missing something very trivial? What is the difference between the two approaches?

Asked By: Peaceful

||

Answers:

sklearn is a package. This answer said it very succinctly:

when you import a package, only variables/functions/classes in the __init__.py file of that package are directly visible, not sub-packages or modules.

datasets is a sub-package of sklearn. This is why this happens:

In [1]: import sklearn

In [2]: sklearn.datasets
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-325a2bfc35d0> in <module>()
----> 1 sklearn.datasets

AttributeError: module 'sklearn' has no attribute 'datasets'

However, the reason why this works:

In [3]: from sklearn import datasets

In [4]: sklearn.datasets
Out[4]: <module 'sklearn.datasets' from '/home/ethan/.virtualenvs/test3/lib/python3.5/site-packages/sklearn/datasets/__init__.py'>

is that when you load the sub-package datasets by doing from sklearn import datasets it is automatically added to the namespace of the package sklearn. This is one of the lesser-known “traps” of the Python import system.

Also, note that if you look at the __init__.py for sklearn you will see 'datasets' as a member of __all__, but this only allows you to do:

In [1]: from sklearn import *
In [2]: datasets
Out[2]: <module 'sklearn.datasets' from '/home/ethan/.virtualenvs/test3/lib/python3.5/site-packages/sklearn/datasets/__init__.py'>

One last point to note is that if you inspect either sklearn or datasets you will see that, although they are packages, their type is module. This is because all packages are considered modules – however, not all modules are packages.

Answered By: elethan