How do I test whether an nltk resource is already installed on the machine running my code?

Question:

I just started my first NLTK project and am confused about the proper setup. I need several resources like the Punkt Tokenizer and the maxent pos tagger. I myself downloaded them using the GUI nltk.download(). For my collaborators I of course want that this things get downloaded automatically. I haven’t found any idiomatic code for that in the docu.

Am I supposed to just put nltk.data.load('tokenizers/punkt/english.pickle') and their like into the code? Is this going to download the resources every time the script is run? Am I to provide feedback to the user (i.e. my co-developers) of what is being downloaded and why this is taking so long? There MUST be gear out there that does the job, right? 🙂

//Edit To explify my question:
How do I test whether an nltk resource (like the Punkt Tokenizer) is already installed on the machine running my code, and install it if it is not?

Asked By: Zakum

||

Answers:

You can use the nltk.data.find() function, see https://github.com/nltk/nltk/blob/develop/nltk/data.py:

>>> import nltk
>>> nltk.data.find('tokenizers/punkt.zip')
ZipFilePathPointer(u'/home/alvas/nltk_data/tokenizers/punkt.zip', u'')

When the resource is not available you’ll find the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk-3.0a3-py2.7.egg/nltk/data.py", line 615, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'punkt.zip' not found.  Please use the NLTK Downloader
  to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************

Most probably, you would like to do something like this to ensure that your collaborators have the package:

>>> try:
...     nltk.data.find('tokenizers/punkt')
... except LookupError:
...     nltk.download('punkt')
... 
[nltk_data] Downloading package punkt to /home/alvas/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
True
Answered By: alvas

After Somnath comment, I am posting an example of the try-except workaround. Here we search for the comtrans module that is not in the nltk data by default.

from nltk.corpus import comtrans
from nltk import download

try:
    words = comtrans.words('alignment-en-fr.txt')
except LookupError:
    print('resource not found. Downloading now...')
    download('comtrans')
    words = comtrans.words('alignment-en-fr.txt')
Answered By: Fab
Categories: questions Tags: , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.