Language detection for short string in a user content generated context

Question

I have some question about the detection of short string. I need to detect the language of text sent in a chat, and I am faced with 2 problems:

the lenght of the message
the errors that may be in it and the noise (emoji etc…)

but for the noise, I clean the message and that work fine but for the lenght of the message, it’s a problem. For exemple If a user say hi, fasttext detect the language as a deutch text, but google translate detect it to an english text. And the most likely it is a message in English. So I try to train my own fasttext model but how can I can adjust the model to have better result in short string? I need to train the model with dictionnary of a lot of language to have better result?

I use fasttext because it’s the most accurate language detector. Here is also an exemple of the problem with fasttext:

# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

import fasttext

text = "Hi"

pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)

predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))

Asked By: Jourdelune

||

Source

Answer 1

In my experience, common approaches based on fastText or other classifiers struggle with short texts.

You could try lingua, a language detection library that is available for Python, Java, Go, and Rust.

Among its strengths:

…yields pretty accurate results on both long and short text, even on
single words and phrases.

She draws on both rule-based and statistical methods but does not use any dictionaries of words.

She does not need
a connection to any external API or service either.

As you can read here, it seems that in Lingua you can also restrict the set of languages to be considered

Answered By: Stefano Fiorucci – anakin87

Answer 2

I have find a way to have better result. If you sum all probability of all languages on different detector like fastText and lingua and add for short text a detection with dictionnary. You can have very good result (for my task, I also made a fastText model trained on my data). I have made a demo for that but moderator don’t accept it so I can’t send the link of the repo.

Answered By: Jourdelune

Language detection for short string in a user content generated context

Question:

Answers: