I want to check in a Python program if a word is in the English dictionary.
I believe nltk wordnet interface might be the way to go but I have no clue how to use it for such a simple task.
def is_english_word(word): pass # how to I implement is_english_word? is_english_word(token.lower())
In the future, I might want to check if the singular form of a word is in the dictionary (e.g., properties -> property -> english word). How would I achieve that?
Using a set to store the word list because looking them up will be faster:
with open("english_words.txt") as word_file: english_words = set(word.strip().lower() for word in word_file) def is_english_word(word): return word.lower() in english_words print is_english_word("ham") # should be true if you have a good english_words.txt
To answer the second part of the question, the plurals would already be in a good word list, but if you wanted to specifically exclude those from the list for some reason, you could indeed write a function to handle it. But English pluralization rules are tricky enough that I’d just include the plurals in the word list to begin with.
As to where to find English word lists, I found several just by Googling “English word list”. Here is one: http://www.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt You could Google for British or American English if you want specifically one of those dialects.
>>> import enchant >>> d = enchant.Dict("en_US") >>> d.check("Hello") True >>> d.check("Helo") False >>> d.suggest("Helo") ['He lo', 'He-lo', 'Hello', 'Helot', 'Help', 'Halo', 'Hell', 'Held', 'Helm', 'Hero', "He'll"] >>>
PyEnchant comes with a few dictionaries (en_GB, en_US, de_DE, fr_FR), but can use any of the OpenOffice ones if you want more languages.
There appears to be a pluralisation library called
inflect, but I’ve no idea whether it’s any good.
For a semantic web approach, you could run a sparql query against WordNet in RDF format. Basically just use urllib module to issue GET request and return results in JSON format, parse using python ‘json’ module. If it’s not English word you’ll get no results.
As another idea, you could query Wiktionary’s API.
from nltk.corpus import wordnet if not wordnet.synsets(word_to_test): #Not an English Word else: #English Word
You should refer to this article if you have trouble installing wordnet or want to try other approaches.
It won’t work well with WordNet, because WordNet does not contain all english words.
Another possibility based on NLTK without enchant is NLTK’s words corpus
>>> from nltk.corpus import words >>> "would" in words.words() True >>> "could" in words.words() True >>> "should" in words.words() True >>> "I" in words.words() True >>> "you" in words.words() True
For a faster NLTK-based solution you could hash the set of words to avoid a linear search.
from nltk.corpus import words as nltk_words def is_english_word(word): # creation of this dictionary would be done outside of # the function because you only need to do it once. dictionary = dict.fromkeys(nltk_words.words(), None) try: x = dictionary[word] return True except KeyError: return False
With pyEnchant.checker SpellChecker:
from enchant.checker import SpellChecker def is_in_english(quote): d = SpellChecker("en_US") d.set_text(quote) errors = [err.word for err in d] return False if ((len(errors) > 4) or len(quote.split()) < 3) else True print(is_in_english('“办理美国加州州立大学圣贝纳迪诺分校高仿成绩单Q/V2166384296加州州立大学圣贝纳迪诺分校学历学位认证')) print(is_in_english('“Two things are infinite: the universe and human stupidity; and I'm not sure about the universe.”')) > False > True
I find that there are 3 package-based solutions to solve the problem. They are pyenchant, wordnet and corpus(self-defined or from ntlk). Pyenchant couldn’t installed easily in win64 with py3. Wordnet doesn’t work very well because it’s corpus isn’t complete. So for me, I choose the solution answered by @Sadik, and use ‘set(words.words())’ to speed up.
pip3 install nltk python3 import nltk nltk.download('words')
from nltk.corpus import words setofwords = set(words.words()) print("hello" in setofwords) >>True
If your OS uses the Linux kernel, there is a simple way to get all the words from the English/American dictionary. In the directory
/usr/share/dict you have a
words file. There is also a more specific
british-english files. These contain all of the words in that specific language. You can access this throughout every programming language which is why I thought you might want to know about this.
Now, for python specific users, the python code below should assign the list words to have the value of every single word:
import re file = open("/usr/share/dict/words", "r") words = re.sub("[^w]", " ", file.read()).split() file.close() def is_word(word): return word.lower() in words is_word("tarts") ## Returns true is_word("jwiefjiojrfiorj") ## Returns False
Hope this helps!
use nltk.corpus instead of enchant. Enchant gives ambiguous results. For example :
for benchmark and bench-mark enchant is returning true. It should suppose to return false for benchmark.
you can see this page :
I recommend the
Download this txt file https://raw.githubusercontent.com/dwyl/english-words/master/words_alpha.txt
then create a
Set out of it using the following python code snippet that loads about 370k non-alphanumeric words in english
>>> with open("/PATH/TO/words_alpha.txt") as f: >>> words = set(f.read().split('n')) >>> len(words) 370106
From here onwards, you can check for existence in constant time using
>>> word_to_check = 'baboon' >>> word_to_check in words True
Note that this set might not be comprehensive but still gets the job done, user should do quality checks to make sure it works for their use-case as well.
For my Wordle Solver, I am using this corpus of 113809 words as the source: http://www.instructables.com/files/orig/FLU/YE8L/H82UHPR8/FLUYE8LH82UHPR8.txt