Spacy: OSError: [E050] Can't find model on Google Colab | Python

Question:

I’m trying to ‘lemmatize’ spanish text using the spanish core model es_core_news_sm. However, I’m getting OSError.

The following code is an example of lemmatization using SpaCy on Google Colabs:

import spacy
spacy.prefer_gpu()

nlp = spacy.load('es_core_news_sm')
text = 'yo canto, tú cantas, ella canta, nosotros cantamos, cantáis, cantan…'
doc = nlp(text)
lemmas = [tok.lemma_.lower() for tok in doc]

Also I tried to import the core, but didn’t work in this way, getting a similar traceback.

import es_core_news_sm
nlp = es_core_news_sm.load()

Traceback:

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-93-fd65d69a4f87> in <module>()
      2 spacy.prefer_gpu()
      3 
----> 4 nlp = spacy.load('es_core_web_sm')
      5 text = 'yo canto, tú cantas, ella canta, nosotros cantamos, cantáis, cantan…'
      6 doc = nlp(text)

1 frames
/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model(name, **overrides)
    137     elif hasattr(name, "exists"):  # Path or Path-like to model data
    138         return load_model_from_path(name, **overrides)
--> 139     raise IOError(Errors.E050.format(name=name))
    140 
    141 

OSError: [E050] Can't find model 'es_core_web_sm'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.
Asked By: Y4RD13

||

Answers:

You first need to download the data:

!spacy download es_core_news_sm

Then Restart the runtime, after which your code will run correctly:

import spacy
spacy.prefer_gpu()

nlp = spacy.load('es_core_news_sm')
text = 'yo canto, tú cantas, ella canta, nosotros cantamos, cantáis, cantan…'
doc = nlp(text)
lemmas = [tok.lemma_.lower() for tok in doc]
print(len(lemmas))
16
Answered By: jakevdp

I ran into similar issues and did the following.
You would need torchtext for this example

spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

I am calling tokenizer through functions. For example:

def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

---------

SRC = Field(tokenize = tokenize_de, 
            init_token = '<sos>', 
            eos_token = '<eos>',
            fix_length = MAX_LEN, 
            lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            fix_length = MAX_LEN,
            lower = True)