Change nltk.download() path directory from default ~/ntlk_data
Question:
I was trying to download/update python nltk
packages on a computing server and it returned this [Errno 122] Disk quota exceeded:
error.
Specifically:
[nltk_data] Downloading package stop words to /home/sh2264/nltk_data...
[nltk_data] Error downloading u'stopwords' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data] pages/packages/corpora/stopwords.zip>: [Errno 122]
[nltk_data] Disk quota exceeded:
[nltk_data] u'/home/sh2264/nltk_data/corpora/stopwords.zip
False
How could I change the entire path for nltk
packages, and what other changes should I make to ensure errorless loading of nltk
?
Answers:
According to the documentation:
By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.
To specify the download directory, use for example:
nltk.download('treebank', download_dir='/mnt/data/treebank')
This can be configured both by command-line (nltk.download(..., download_dir=)
or by GUI. Bizarrely nltk seems to totally ignore its own environment variable NLTK_DATA
and default its download directories to a standard set of five paths, regardless whether NLTK_DATA
is defined and where it points, and regardless whether nltk’s five default dirs even exist on the machine or architecture(!). Some of that is documented in Installing NLTK Data, although it’s incomplete and kinda buried; reproduced below with much clearer formatting:
Command line installation
The downloader will search for an existing nltk_data
directory to
install NLTK data. If one does not exist it will attempt to create one
in a central location (when using an administrator account) or
otherwise in the user’s filespace. If necessary, run the download
command from an administrator account, or using sudo. The recommended
system location is:
C:nltk_data
(Windows) ;
/usr/local/share/nltk_data
(Mac) and
/usr/share/nltk_data
(Unix).
You can use the -d flag to specify a different location (but if you do this, be sure to set the NLTK_DATA environment variable accordingly).
-
Run the command python -m nltk.downloader all
-
To ensure central installation, run the command: sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
-
But really they should say: sudo python -m nltk.downloader -d $NLTK_DATA all
Now as to what recommended path NLTK_DATA should use, nltk doesn’t really give any proper guidance, but it should be a generic standalone path not under any install tree (so not under <python-install-directory>/lib/site-packages
) or any user dir. Hence, /usr/local/share
, /opt/share
or similar. On MacOS 10.7+, /usr
and thus /usr/local/
these days are hidden by default, so /opt/share
may well be a better choice. Or do chflags nohidden /usr/local/share
.
NLTK GUI can be started from PyCharm Community Edition Python console too.
Just issue 2 commands:
1) import nltk
2) nltk.download_gui()
but nltk GUI will not work if you are behind a proxy server for that at the console you must first set proxy setting
SET HTTP_PROXY=proxy.mycompany.com:8080
and then it will work.
This would be the solution purely from shell commands (no python code).
First use this command line to download to a custom directory:
python -m nltk.downloader -d /my/path/nltk_data all
And then at runtime for nltk to find the custom directory, set the environment variable:
export NLTK_DATA=/my/path/nltk_data
tl;dr
If the environment variable NLTK_DATA
is set and the directory exists, it is used as default download directory.
Explanation
In the data module the environment variable NLTK_DATA
is used as first entry when filling the data search path.
If download_dir
is not specified as parameter when calling nltk.download() the method default_download_dir() determines the download directory.
Example: Create new default data directory
One would like to use /usr/local/share/nltk_data
as default data directory.
Create the data directory and add NLTK_DATA
to your shell profile.
$ mkdir /usr/local/share/nltk_data
$ echo "export NLTK_DATA=/usr/local/share/nltk_data" >> ~/.bashrc
$ source ~/.bashrc
$ echo $NLTK_DATA
/usr/local/share/nltk_data
Now nltk
uses /usr/local/share/nltk_data
as defined in NLTK_DATA
.
$ python
Python 3.10.6 (main, Sep 5 2022, 11:08:58) [Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data] /usr/local/share/nltk_data...
[nltk_data] Unzipping corpora/gutenberg.zip.
True
Example: Switch default data directory
The current data directory is ~/nltk_data
and one would like to use the directory /usr/local/share/nltk_data
instead.
Move the data directory and add NLTK_DATA
to your shell profile.
$ mv ~/nltk_data /usr/local/share/nltk_data
$ echo "export NLTK_DATA=/usr/local/share/nltk_data" >> ~/.bashrc
$ source ~/.bashrc
$ echo $NLTK_DATA
/usr/local/share/nltk_data
Now nltk
uses /usr/local/share/nltk_data
as defined in NLTK_DATA
.
$ python
Python 3.10.6 (main, Sep 5 2022, 11:08:58) [Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data] /usr/local/share/nltk_data...
[nltk_data] Package gutenberg is already up-to-date!
True
I was trying to download/update python nltk
packages on a computing server and it returned this [Errno 122] Disk quota exceeded:
error.
Specifically:
[nltk_data] Downloading package stop words to /home/sh2264/nltk_data...
[nltk_data] Error downloading u'stopwords' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data] pages/packages/corpora/stopwords.zip>: [Errno 122]
[nltk_data] Disk quota exceeded:
[nltk_data] u'/home/sh2264/nltk_data/corpora/stopwords.zip
False
How could I change the entire path for nltk
packages, and what other changes should I make to ensure errorless loading of nltk
?
According to the documentation:
By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.
To specify the download directory, use for example:
nltk.download('treebank', download_dir='/mnt/data/treebank')
This can be configured both by command-line (nltk.download(..., download_dir=)
or by GUI. Bizarrely nltk seems to totally ignore its own environment variable NLTK_DATA
and default its download directories to a standard set of five paths, regardless whether NLTK_DATA
is defined and where it points, and regardless whether nltk’s five default dirs even exist on the machine or architecture(!). Some of that is documented in Installing NLTK Data, although it’s incomplete and kinda buried; reproduced below with much clearer formatting:
Command line installation
The downloader will search for an existing
nltk_data
directory to
install NLTK data. If one does not exist it will attempt to create one
in a central location (when using an administrator account) or
otherwise in the user’s filespace. If necessary, run the download
command from an administrator account, or using sudo. The recommended
system location is:
C:nltk_data
(Windows) ;/usr/local/share/nltk_data
(Mac) and/usr/share/nltk_data
(Unix).You can use the -d flag to specify a different location (but if you do this, be sure to set the NLTK_DATA environment variable accordingly).
Run the command
python -m nltk.downloader all
To ensure central installation, run the command:
sudo python -m nltk.downloader -d /usr/local/share/nltk_data all
But really they should say:
sudo python -m nltk.downloader -d $NLTK_DATA all
Now as to what recommended path NLTK_DATA should use, nltk doesn’t really give any proper guidance, but it should be a generic standalone path not under any install tree (so not under <python-install-directory>/lib/site-packages
) or any user dir. Hence, /usr/local/share
, /opt/share
or similar. On MacOS 10.7+, /usr
and thus /usr/local/
these days are hidden by default, so /opt/share
may well be a better choice. Or do chflags nohidden /usr/local/share
.
NLTK GUI can be started from PyCharm Community Edition Python console too.
Just issue 2 commands:
1) import nltk
2) nltk.download_gui()
but nltk GUI will not work if you are behind a proxy server for that at the console you must first set proxy setting
SET HTTP_PROXY=proxy.mycompany.com:8080
and then it will work.
This would be the solution purely from shell commands (no python code).
First use this command line to download to a custom directory:
python -m nltk.downloader -d /my/path/nltk_data all
And then at runtime for nltk to find the custom directory, set the environment variable:
export NLTK_DATA=/my/path/nltk_data
tl;dr
If the environment variable NLTK_DATA
is set and the directory exists, it is used as default download directory.
Explanation
In the data module the environment variable NLTK_DATA
is used as first entry when filling the data search path.
If download_dir
is not specified as parameter when calling nltk.download() the method default_download_dir() determines the download directory.
Example: Create new default data directory
One would like to use /usr/local/share/nltk_data
as default data directory.
Create the data directory and add NLTK_DATA
to your shell profile.
$ mkdir /usr/local/share/nltk_data
$ echo "export NLTK_DATA=/usr/local/share/nltk_data" >> ~/.bashrc
$ source ~/.bashrc
$ echo $NLTK_DATA
/usr/local/share/nltk_data
Now nltk
uses /usr/local/share/nltk_data
as defined in NLTK_DATA
.
$ python
Python 3.10.6 (main, Sep 5 2022, 11:08:58) [Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data] /usr/local/share/nltk_data...
[nltk_data] Unzipping corpora/gutenberg.zip.
True
Example: Switch default data directory
The current data directory is ~/nltk_data
and one would like to use the directory /usr/local/share/nltk_data
instead.
Move the data directory and add NLTK_DATA
to your shell profile.
$ mv ~/nltk_data /usr/local/share/nltk_data
$ echo "export NLTK_DATA=/usr/local/share/nltk_data" >> ~/.bashrc
$ source ~/.bashrc
$ echo $NLTK_DATA
/usr/local/share/nltk_data
Now nltk
uses /usr/local/share/nltk_data
as defined in NLTK_DATA
.
$ python
Python 3.10.6 (main, Sep 5 2022, 11:08:58) [Clang 11.0.0 (clang-1100.0.33.17)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('gutenberg')
[nltk_data] Downloading package gutenberg to
[nltk_data] /usr/local/share/nltk_data...
[nltk_data] Package gutenberg is already up-to-date!
True