How to determine the language of a piece of text?

Question

I want to get this:

Input text: "ру́сский язы́к"
Output text: "Russian" 

Input text: "中文"
Output text: "Chinese" 

Input text: "にほんご"
Output text: "Japanese" 

Input text: "العَرَبِيَّة"
Output text: "Arabic"

How can I do it in python?

Asked By: Rita

||

Source

Answer 1

Have you had a look at langdetect?

from langdetect import detect

lang = detect("Ein, zwei, drei, vier")

print lang
#output: de

Answered By: dheiberg

Answer 2

You can try determining the Unicode group of chars in input string to point out type of language, (Cyrillic for Russian, for example), and then search for language-specific symbols in text.

Answered By: Kerbiter

Answer 3

1. TextBlob. (Deprecated – Use official Google Translate API instead)

Requires NLTK package, uses Google.

from textblob import TextBlob
b = TextBlob("bonjour")
b.detect_language()

pip install textblob

Note: This solution requires internet access and Textblob is using Google Translate’s language detector by calling the API.

2. Polyglot.

Requires numpy and some arcane libraries, ~~unlikely to get it work for Windows~~. (For Windows, get an appropriate versions of PyICU, Morfessor and PyCLD2 from here, then just pip install downloaded_wheel.whl.) Able to detect texts with mixed languages.

from polyglot.detect import Detector

mixed_text = u"""
China (simplified Chinese: 中国; traditional Chinese: 中國),
officially the People's Republic of China (PRC), is a sovereign state
located in East Asia.
"""
for language in Detector(mixed_text).languages:
        print(language)

# name: English     code: en       confidence:  87.0 read bytes:  1154
# name: Chinese     code: zh_Hant  confidence:   5.0 read bytes:  1755
# name: un          code: un       confidence:   0.0 read bytes:     0

pip install polyglot

To install the dependencies, run:
sudo apt-get install python-numpy libicu-dev

Note: Polyglot is using pycld2, see https://github.com/aboSamoor/polyglot/blob/master/polyglot/detect/base.py#L72 for details.

3. chardet

Chardet has also a feature of detecting languages if there are character bytes in range (127-255]:

>>> chardet.detect("Я люблю вкусные пампушки".encode('cp1251'))
{'encoding': 'windows-1251', 'confidence': 0.9637267119204621, 'language': 'Russian'}

pip install chardet

4. langdetect

Requires large portions of text. It uses non-deterministic approach under the hood. That means you get different results for the same text sample. Docs say you have to use following code to make it determined:

from langdetect import detect, DetectorFactory
DetectorFactory.seed = 0
detect('今一はお前さん')

pip install langdetect

5. guess_language

Can detect very short samples by using this spell checker with dictionaries.

pip install guess_language-spirit

6. langid

langid.py provides both a module

import langid
langid.classify("This is a test")
# ('en', -54.41310358047485)

and a command-line tool:

$ langid < README.md

pip install langid

7. FastText

FastText is a text classifier, can be used to recognize 176 languages with a proper models for language classification. Download this model, then:

import fasttext
model = fasttext.load_model('lid.176.ftz')
print(model.predict('الشمس تشرق', k=2))  # top 2 matching languages

(('__label__ar', '__label__fa'), array([0.98124713, 0.01265871]))

pip install fasttext

8. pyCLD3

pycld3 is a neural network model for language identification. This package contains the inference code and a trained model.

import cld3
cld3.get_language("影響包含對氣候的變化以及自然資源的枯竭程度")

LanguagePrediction(language='zh', probability=0.999969482421875, is_reliable=True, proportion=1.0)

pip install pycld3

Answered By: Rabash

Answer 4

There is an issue with langdetect when it is being used for parallelization and it fails. But spacy_langdetect is a wrapper for that and you can use it for that purpose. You can use the following snippet as well:

import spacy
from spacy_langdetect import LanguageDetector

nlp = spacy.load("en")
nlp.add_pipe(LanguageDetector(), name="language_detector", last=True)
text = "This is English text Er lebt mit seinen Eltern und seiner Schwester in Berlin. Yo me divierto todos los días en el parque. Je m'appelle Angélica Summer, j'ai 12 ans et je suis canadienne."
doc = nlp(text)
# document level language detection. Think of it like average language of document!
print(doc._.language['language'])
# sentence level language detection
for i, sent in enumerate(doc.sents):
    print(sent, sent._.language)

Answered By: Habib Karbasian

Answer 5

Pretrained Fast Text Model Worked Best For My Similar Needs

I arrived at your question with a very similar need. I found the most help from Rabash’s answers for my specific needs.

After experimenting to find what worked best among his recommendations, which was making sure that text files were in English in 60,000+ text files, I found that fasttext was an excellent tool for such a task.

With a little work, I had a tool that worked very fast over many files. But it could be easily modified for something like your case, because fasttext works over a list of lines easily.

My code with comments is among the answers on THIS post. I believe that you and others can easily modify this code for other specific needs.

Answered By: Thom Ives

Answer 6

Depending on the case, you might be interested in using one of the following methods:

Method 0: Use an API or library

Usually, there are a few problems with these libraries because some of them are not accurate for small texts, some languages are missing, are slow, require internet connection, are non-free,… But generally speaking, they will suit most needs.

Method 1: Language models

A language model gives us the probability of a sequence of words. This is important because it allows us to robustly detect the language of a text, even when the text contains words in other languages (e.g.: “‘Hola’ means ‘hello’ in spanish”).

You can use N language models (one per language), to score your text. The detected language will be the language of the model that gave you the highest score.

If you want to build a simple language model for this, I’d go for 1-grams. To do this, you only need to count the number of times each word from a big text (e.g. Wikipedia Corpus in “X” language) has appeared.

Then, the probability of a word will be its frequency divided by the total number of words analyzed (sum of all frequencies).

the 23135851162
of  13151942776
and 12997637966
to  12136980858
a   9081174698
in  8469404971
for 5933321709
...

=> P("'Hola' means 'hello' in spanish") = P("hola") * P("means") * P("hello") * P("in") * P("spanish")

If the text to detect is quite big, I recommend sampling N random words and then use the sum of logarithms instead of multiplications to avoid floating-point precision problems.

P(s) = 0.03 * 0.01 * 0.014 = 0.0000042
P(s) = log10(0.03) + log10(0.01) + log10(0.014) = -5.376

Method 2: Intersecting sets

An even simpler approach is to prepare N sets (one per language) with the top M most frequent words. Then intersect your text with each set. The set with the highest number of intersections will be your detected language.

spanish_set = {"de", "hola", "la", "casa",...}
english_set = {"of", "hello", "the", "house",...}
czech_set = {"z", "ahoj", "závěrky", "dům",...}
...

text_set = {"hola", "means", "hello", "in", "spanish"}

spanish_votes = text_set.intersection(spanish_set)  # 1
english_votes = text_set.intersection(english_set)  # 4
czech_votes = text_set.intersection(czech_set)  # 0
...

Method 3: Zip compression

This more a curiosity than anything else, but here it goes… You can compress your text (e.g LZ77) and then measure the zip-distance with regards to a reference compressed text (target language). Personally, I didn’t like it because it’s slower, less accurate and less descriptive than other methods. Nevertheless, there might be interesting applications for this method.
To read more: Language Trees and Zipping

Answered By: Salva Carrión

Answer 7

If you are looking for a library that is fast with long texts, polyglot and fastext are doing the best job here.

I sampled 10000 documents from a collection of dirty and random HTMLs, and here are the results:

+------------+----------+
| Library    | Time     |
+------------+----------+
| polyglot   | 3.67 s   |
+------------+----------+
| fasttext   | 6.41     |
+------------+----------+
| cld3       | 14 s     |
+------------+----------+
| langid     | 1min 8s  |
+------------+----------+
| langdetect | 2min 53s |
+------------+----------+
| chardet    | 4min 36s |
+------------+----------+

I have noticed that a lot of the methods focus on short texts, probably because it is the hard problem to solve: if you have a lot of text, it is really easy to detect languages (e.g. one could just use a dictionary!). However, this makes it difficult to find for an easy and suitable method for long texts.

Answered By: toto_tico

Answer 8

You can use Googletrans (unofficial) a free and unlimited Google translate API for Python.

You can make as many requests as you want, there are no limits

Installation:

$ pip install googletrans

Language detection:

>>> from googletrans import Translator
>>> t = Translator().detect("hello world!")
>>> t.lang
'en'
>>> t.confidence
0.8225234

Answered By: h3t1

Answer 9

I have tried all the libraries out there, and i concluded that pycld2 is the best one, fast and accurate.

you can install it like this:

python -m pip install -U pycld2

you can use it like this:

isReliable, textBytesFound, details = cld2.detect(your_sentence)

print(isReliable, details[0][1]) # reliablity(bool),lang abbrev.(en/es/de...)

Answered By: Simone

Answer 10

@Rabash had a good list of tools on https://stackoverflow.com/a/47106810/610569

And @toto_tico did a nice job in presenting the speed comparison.

Here’s a summary to complete the great answers above (as of 2021)

Language ID software	Used by	Open Source / Model	Rule-based	Stats-based	Can train/tune
Google Translate Language Detection	TextBlob (limited usage)	✕	–	–	✕
Chardet	–	✓	✓	✕	✕
Guess Language (non-active development)	spirit-guess (updated rewrite)	✓	✓	Minimally	✕
pyCLD2	Polyglot	✓	Somewhat	✓	Not sure
CLD3	–	✓	✕	✓	Possibly
langid-py	–	✓	Not sure	✓	✓
langdetect	SpaCy-langdetect	✓	✕	✓	✓
FastText	What The Lang	✓	✕	✓	Not sure

Answered By: alvas

Answer 11

Polygot or Cld2 are among the best suggestions because they can detect multiple language in text. But, they are not easy to be installed on Windows because of "building wheel fail".

A solution that worked for me ( I am using Windows 10 ) is installing CLD2-CFFI

so first install cld2-cffi

pip install cld2-cffi

and then use it like this:

text_content = """ A accès aux chiens et aux frontaux qui lui ont été il peut 
consulter et modifier ses collections et exporter Cet article concerne le pays 
européen aujourd’hui appelé République française. 
Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller 
trouver votre aide dans le menu ci-dessus. 
Welcome, to this world of Data Scientist. Today is a lovely day."""

import cld2

isReliable, textBytesFound, details = cld2.detect(text_content)
print('  reliable: %s' % (isReliable != 0))
print('  textBytes: %s' % textBytesFound)
print('  details: %s' % str(details))

Th output is like this:

reliable: True
textBytes: 377
details: (Detection(language_name='FRENCH', language_code='fr', percent=74, 
score=1360.0), Detection(language_name='ENGLISH', language_code='en', 
percent=25, score=1141.0), Detection(language_name='Unknown', 
language_code='un', percent=0, score=0.0))

Answered By: parvaneh shayegh

Answer 12

If the language you want to detect is among these…

arabic (ar)
bulgarian (bg)
german (de)
modern greek (el)
english (en)
spanish (es)
french (fr)
hindi (hi)
italian (it)
japanese (ja)
dutch (nl)
polish (pl)
portuguese (pt)
russian (ru)
swahili (sw)
thai (th)
turkish (tr)
urdu (ur)
vietnamese (vi)
chinese (zh)

…then it is relatively easy with HuggingFace libraries and models (Deep Learning Natural Language Processing, if you are not familiar with it):

# Import libraries
from transformers import pipeline
# Load pipeline
classifier = pipeline("text-classification", model = "papluca/xlm-roberta-base-language-detection")
# Example sentence
sentence1 = 'Ciao, come stai?'
# Get language
classifier(sentence1)

Output:

[{'label': 'it', 'score': 0.9948362112045288}]

label is the predicted language, and score is the assigned score to it: you can think of it as a confidence measure.
Some details:

The training set contains 70k samples, while the validation and test
sets 10k each. The average accuracy on the test set is 99.6%

You can finde more info at the model’s page, and I suppose that you could find other models that fit you needs.

Answered By: SilentCloud

Answer 13

You can install the pycld2 python library

pip install pycld2

or

python -m pip install -U pycld2

for the below code to work.

import pycld2 as cld2

isReliable, textBytesFound, details = cld2.detect(
    "а неправильный формат идентификатора дн назад"
)

print(isReliable)
# True
details[0]
# ('RUSSIAN', 'ru', 98, 404.0)

fr_en_Latn = """
France is the largest country in Western Europe and the third-largest in Europe as a whole.
A accès aux chiens et aux frontaux qui lui ont été il peut consulter et modifier ses collections
et exporter Cet article concerne le pays européen aujourd’hui appelé République française.
Pour d’autres usages du nom France, Pour une aide rapide et effective, veuiller trouver votre aide
dans le menu ci-dessus.
Motoring events began soon after the construction of the first successful gasoline-fueled automobiles.
The quick brown fox jumped over the lazy dog."""

isReliable, textBytesFound, details, vectors = cld2.detect(
    fr_en_Latn, returnVectors=True
)
print(vectors)
# ((0, 94, 'ENGLISH', 'en'), (94, 329, 'FRENCH', 'fr'), (423, 139, 'ENGLISH', 'en'))

Pycld2 python library is a python binding for the Compact Language Detect 2 (CLD2). You can explore the different functionality of Pycld2. Know about the Pycld2 here.

Answered By: Léo

Answer 14

I like the approach offered by TextBlob for language detection. Its quite simple and easy to implement and uses fewer lines of code. before you begin. you will need to install the textblob python library for the below code to work.

from textblob import TextBlob
text = "это компьютерный портал для гиков."
lang = TextBlob(text)
print(lang.detect_language())

On the other hand, if you have a combination of various languages used, you might want to try pycld2 that allows language detection by defining parts of the sentence or paragraph with accuracy.

Answered By: SaaSy Monster

Answer 15

I would say lingua.py all the way. It is much faster and more accurate than fasttext. Definately deserves to be listed here.

Installation

poety add lingua-language-detector

Usage

from typing import List
from lingua.language import Language
from lingua.builder import LanguageDetectorBuilder
languages: List[Language] = [Language.ENGLISH, Language.TURKISH, Language.PERSIAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()

if __name__ == "__main__":
    print(detector.detect_language_of("Ben de iyiyim. Tesekkurler.")) # Language.TURKISH
    print(detector.detect_language_of("I'm fine and you?")) # Language.ENGLISH
    print(detector.detect_language_of("حال من خوبه؟ شما چطورید؟")) # Language.PERSIAN

Answered By: pouya

Answer 16

The best way to determine the laguage of a text is to implement the following function:

from langdetect import detect

def get_language(text):

    keys =['ab', 'aa', 'af', 'ak', 'sq', 'am', 'ar', 'an', 'hy', 'as', 'av', 'ae', 'ay', 'az', 'bm', 'ba', 'eu', 'be', 'bn', 'bi', 'bs', 'br', 'bg', 'my', 'ca', 'ch', 'ce', 'ny', 'zh', 'cu', 'cv', 'kw', 'co', 'cr', 'hr', 'cs', 'da', 'dv', 'nl', 'dz', 'en', 'eo', 'et', 'ee', 'fo', 'fj', 'fi', 'fr', 'fy', 'ff', 'gd', 'gl', 'lg', 'ka', 'de', 'el', 'kl', 'gn', 'gu', 'ht', 'ha', 'he', 'hz', 'hi', 'ho', 'hu', 'is', 'io', 'ig', 'id', 'ia', 'ie', 'iu', 'ik', 'ga', 'it', 'ja', 'jv', 'kn', 'kr', 'ks', 'kk', 'km', 'ki', 'rw', 'ky', 'kv', 'kg', 'ko', 'kj', 'ku', 'lo', 'la', 'lv', 'li', 'ln', 'lt', 'lu', 'lb', 'mk', 'mg', 'ms', 'ml', 'mt', 'gv', 'mi', 'mr', 'mh', 'mn', 'na', 'nv', 'nd', 'nr', 'ng', 'ne', 'no', 'nb', 'nn', 'ii', 'oc', 'oj', 'or', 'om', 'os', 'pi', 'ps', 'fa', 'pl', 'pt', 'pa', 'qu', 'ro', 'rm', 'rn', 'ru', 'se', 'sm', 'sg', 'sa', 'sc', 'sr', 'sn', 'sd', 'si', 'sk', 'sl', 'so', 'st', 'es', 'su', 'sw', 'ss', 'sv', 'tl', 'ty', 'tg', 'ta', 'tt', 'te', 'th', 'bo', 'ti', 'to', 'ts', 'tn', 'tr', 'tk', 'tw', 'ug', 'uk', 'ur', 'uz', 've', 'vi', 'vo', 'wa', 'cy', 'wo', 'xh', 'yi', 'yo', 'za', 'zu']
    
    langs = ['Abkhazian', 'Afar', 'Afrikaans', 'Akan', 'Albanian', 'Amharic', 'Arabic', 'Aragonese', 'Armenian', 'Assamese', 'Avaric', 'Avestan', 'Aymara', 'Azerbaijani', 'Bambara', 'Bashkir', 'Basque', 'Belarusian', 'Bengali', 'Bislama', 'Bosnian', 'Breton', 'Bulgarian', 'Burmese', 'Catalan, Valencian', 'Chamorro', 'Chechen', 'Chichewa, Chewa, Nyanja', 'Chinese', 'Church Slavonic, Old Slavonic, Old Church Slavonic', 'Chuvash', 'Cornish', 'Corsican', 'Cree', 'Croatian', 'Czech', 'Danish', 'Divehi, Dhivehi, Maldivian', 'Dutch, Flemish', 'Dzongkha', 'English', 'Esperanto', 'Estonian', 'Ewe', 'Faroese', 'Fijian', 'Finnish', 'French', 'Western Frisian', 'Fulah', 'Gaelic, Scottish Gaelic', 'Galician', 'Ganda', 'Georgian', 'German', 'Greek, Modern (1453–)', 'Kalaallisut, Greenlandic', 'Guarani', 'Gujarati', 'Haitian, Haitian Creole', 'Hausa', 'Hebrew', 'Herero', 'Hindi', 'Hiri Motu', 'Hungarian', 'Icelandic', 'Ido', 'Igbo', 'Indonesian', 'Interlingua (International Auxiliary Language Association)', 'Interlingue, Occidental', 'Inuktitut', 'Inupiaq', 'Irish', 'Italian', 'Japanese', 'Javanese', 'Kannada', 'Kanuri', 'Kashmiri', 'Kazakh', 'Central Khmer', 'Kikuyu, Gikuyu', 'Kinyarwanda', 'Kirghiz, Kyrgyz', 'Komi', 'Kongo', 'Korean', 'Kuanyama, Kwanyama', 'Kurdish', 'Lao', 'Latin', 'Latvian', 'Limburgan, Limburger, Limburgish', 'Lingala', 'Lithuanian', 'Luba-Katanga', 'Luxembourgish, Letzeburgesch', 'Macedonian', 'Malagasy', 'Malay', 'Malayalam', 'Maltese', 'Manx', 'Maori', 'Marathi', 'Marshallese', 'Mongolian', 'Nauru', 'Navajo, Navaho', 'North Ndebele', 'South Ndebele', 'Ndonga', 'Nepali', 'Norwegian', 'Norwegian Bokmål', 'Norwegian Nynorsk', 'Sichuan Yi, Nuosu', 'Occitan', 'Ojibwa', 'Oriya', 'Oromo', 'Ossetian, Ossetic', 'Pali', 'Pashto, Pushto', 'Persian', 'Polish', 'Portuguese', 'Punjabi, Panjabi', 'Quechua', 'Romanian, Moldavian, Moldovan', 'Romansh', 'Rundi', 'Russian', 'Northern Sami', 'Samoan', 'Sango', 'Sanskrit', 'Sardinian', 'Serbian', 'Shona', 'Sindhi', 'Sinhala, Sinhalese', 'Slovak', 'Slovenian', 'Somali', 'Southern Sotho', 'Spanish, Castilian', 'Sundanese', 'Swahili', 'Swati', 'Swedish', 'Tagalog', 'Tahitian', 'Tajik', 'Tamil', 'Tatar', 'Telugu', 'Thai', 'Tibetan', 'Tigrinya', 'Tonga (Tonga Islands)', 'Tsonga', 'Tswana', 'Turkish', 'Turkmen', 'Twi', 'Uighur, Uyghur', 'Ukrainian', 'Urdu', 'Uzbek', 'Venda', 'Vietnamese', 'Volapük', 'Walloon', 'Welsh', 'Wolof', 'Xhosa', 'Yiddish', 'Yoruba', 'Zhuang, Chuang', 'Zulu']
    
    lang_dict = {key : lan for (key, lan) in zip(keys, langs)}
    
    return lang_dict[detect(text)]

Let’s try it:

>>> get_language("Ich liebe meine Frau")

... 'German'

Answered By: Khaled DELLAL