Add/remove custom stop words with spacy

Question:

What is the best way to add/remove stop words with spacy? I am using token.is_stop function and would like to make some custom changes to the set. I was looking at the documentation but could not find anything regarding of stop words. Thanks!

Asked By: E.K.

||

Answers:

You can edit them before processing your text like this (see this post):

>>> import spacy
>>> nlp = spacy.load("en")
>>> nlp.vocab["the"].is_stop = False
>>> nlp.vocab["definitelynotastopword"].is_stop = True
>>> sentence = nlp("the word is definitelynotastopword")
>>> sentence[0].is_stop
False
>>> sentence[3].is_stop
True

Note: This seems to work <=v1.8. For newer versions, see other answers.

Answered By: dantiston

Short answer for version 2.0 and above (just tested with 3.4+):

from spacy.lang.en.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

STOP_WORDS.add("your_additional_stop_word_here")
  • This loads all stop words as a set.
  • You can add your stop words to STOP_WORDS or use your own list in the first place.

To check if the attribute is_stop for the stop words is set to True use this:

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    print(lexeme.text, lexeme.is_stop)

In the unlikely case that stop words for some reason aren’t set to is_stop = True do this:

for word in STOP_WORDS:
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True 

Detailed explanation step by step with links to documentation.

First we import spacy:

import spacy

To instantiate class Language as nlp from scratch we need to import Vocab and Language. Documentation and example here.

from spacy.vocab import Vocab
from spacy.language import Language

# create new Language object from scratch
nlp = Language(Vocab())

stop_words is a default attribute of class Language and can be set to customize the default language data. Documentation here. You can find spacy’s GitHub repo folder with defaults for various languages here.

For our instance of nlp we get 0 stop words which is reasonable since we haven’t set any language with defaults

print(f"Language instance 'nlp' has {len(nlp.Defaults.stop_words)} default stopwords.")
>>> Language instance 'nlp' has 0 default stopwords.

Let’s import English language defaults.

from spacy.lang.en import English

Now we have 326 default stop words.

print(f"The language default English has {len(spacy.lang.en.STOP_WORDS)} stopwords.")
print(sorted(list(spacy.lang.en.STOP_WORDS))[:10])
>>> The language default English has 326 stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']

Let’s create a new instance of Language, now with defaults for English. We get the same result.

nlp = English()
print(f"Language instance 'nlp' now has {len(nlp.Defaults.stop_words)} default stopwords.")
print(sorted(list(nlp.Defaults.stop_words))[:10])
>>> Language instance 'nlp' now has 326 default stopwords.
>>> ["'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across']

To check if all words are set to is_stop = True we iterate over the stop words, retrieve the lexeme from vocab and print out the is_stop attribute.

[nlp.vocab[word].is_stop for word in nlp.Defaults.stop_words][:10]
>>> [True, True, True, True, True, True, True, True, True, True]

We can add stopwords to the English language defaults.

spacy.lang.en.STOP_WORDS.add("aaaahhh-new-stopword")
print(len(spacy.lang.en.STOP_WORDS))
# these propagate to our instance 'nlp' too! 
print(len(nlp.Defaults.stop_words))
>>> 327
>>> 327

Or we can add new stopwords to instance nlp. However, these propagate to our language defaults too!

nlp.Defaults.stop_words.add("_another-new-stop-word")
print(len(spacy.lang.en.STOP_WORDS))
print(len(nlp.Defaults.stop_words))
>>> 328
>>> 328

The new stop words are set to is_stop = True.

print(nlp.vocab["aaaahhh-new-stopword"].is_stop)
print(nlp.vocab["_another-new-stop-word"].is_stop)
>>> True
>>> True
Answered By: petezurich

For 2.0 use the following:

for word in nlp.Defaults.stop_words:
    lex = nlp.vocab[word]
    lex.is_stop = True
Answered By: harryhorn

Using Spacy 2.0.11, you can update its stopwords set using one of the following:

To add a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.add("my_new_stopword")

To add several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words |= {"my_new_stopword1","my_new_stopword2",}

To remove a single stopword:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words.remove("whatever")

To remove several stopwords at once:

import spacy    
nlp = spacy.load("en")
nlp.Defaults.stop_words -= {"whatever", "whenever"}

Note: To see the current set of stopwords, use:

print(nlp.Defaults.stop_words)

Update : It was noted in the comments that this fix only affects the current execution. To update the model, you can use the methods nlp.to_disk("/path") and nlp.from_disk("/path") (further described at https://spacy.io/usage/saving-loading).

Answered By: Romain

This collects the stop words too 🙂

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

Answered By: SolitaryReaper

In latest version following would remove the word out of the list:

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords.remove('not')
Answered By: Sezin

For version 2.3.0
If you want to replace the entire list instead of adding or removing a few stop words, you can do this:

custom_stop_words = set(['the','and','a'])

# First override the stop words set for the language
cls = spacy.util.get_lang_class('en')
cls.Defaults.stop_words = custom_stop_words

# Now load your model
nlp = spacy.load('en_core_web_md')

The trick is to assign the stop word set for the language before loading the model. It also ensures that any upper/lower case variation of the stop words are considered stop words.

Answered By: Joe

See below piece of code

# Perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

# Print the set of spaCy's default stop words (remember that sets are unordered):
print(nlp.Defaults.stop_words)

len(nlp.Defaults.stop_words)

# Make list of word you want to add to stop words
list = ['apple', 'ball', 'cat']

# Iterate this in loop

for item in list:
    # Add the word to the set of stop words. Use lowercase!
    nlp.Defaults.stop_words.add(item)
    
    # Set the stop_word tag on the lexeme
    nlp.vocab[item].is_stop = True

Hope this helps. You can print length before and after to confirm.

Answered By: servolt
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.