NLTK and language detection

Question:

How do I detect what language a text is written in using NLTK?

The examples I’ve seen use nltk.detect, but when I’ve installed it on my mac, I cannot find this package.

Asked By: niklassaers

||

Answers:

Have you come across the following code snippet?

english_vocab = set(w.lower() for w in nltk.corpus.words.words())
text_vocab = set(w.lower() for w in text if w.lower().isalpha())
unusual = text_vocab.difference(english_vocab) 

from http://groups.google.com/group/nltk-users/browse_thread/thread/a5f52af2cbc4cfeb?pli=1&safe=active

Or the following demo file?

https://web.archive.org/web/20120202055535/http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/misc/langid.py

Answered By: William Niu

Although this is not in the NLTK, I have had great results with another Python-based library :

https://github.com/saffsd/langid.py

This is very simple to import and includes a large number of languages in its model.

Answered By: burgersmoke

This library is not from NLTK either but certainly helps.

$ sudo pip install langdetect

Supported Python versions 2.6, 2.7, 3.x.

>>> from langdetect import detect

>>> detect("War doesn't show who's right, just who's left.")
'en'
>>> detect("Ein, zwei, drei, vier")
'de'

https://pypi.python.org/pypi/langdetect?

P.S.: Don’t expect this to work correctly always:

>>> detect("today is a good day")
'so'
>>> detect("today is a good day.")
'so'
>>> detect("la vita e bella!")
'it'
>>> detect("khoobi? khoshi?")
'so'
>>> detect("wow")
'pl'
>>> detect("what a day")
'en'
>>> detect("yay!")
'so'
Answered By: SVK

Super late but, you could use textcat classifier in nltk, here. This paper discusses the algorithm.

It returns a country code in ISO 639-3, so I would use pycountry to get the full name.

For example, load the libraries

import nltk
import pycountry
from nltk.stem import SnowballStemmer

Now let’s look at two phrases, and guess their language:

phrase_one = "good morning"
phrase_two = "goeie more"

tc = nltk.classify.textcat.TextCat() 
guess_one = tc.guess_language(phrase_one)
guess_two = tc.guess_language(phrase_two)

guess_one_name = pycountry.languages.get(alpha_3=guess_one).name
guess_two_name = pycountry.languages.get(alpha_3=guess_two).name
print(guess_one_name)
print(guess_two_name)

English
Afrikaans

You could then pass them into other nltk functions, for example:

stemmer = SnowballStemmer(guess_one_name.lower())
s1 = "walking"
print(stemmer.stem(s1))
walk

Disclaimer obviously this will not always work, especially for sparse data

Extreme example

guess_example = tc.guess_language("hello")
print(pycountry.languages.get(alpha_3=guess_example).name)
Konkani (individual language)
Answered By: RK1

polyglot.detect can detect the language:

from polyglot.detect import Detector

foreign = 'Este libro ha sido uno de los mejores libros que he leido.'
print(Detector(foreign).language)

name: Spanish     code: es       confidence:  98.0 read bytes:   865
Answered By: Ryan Xu
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.