Language detection for short string in a user content generated context

Question:

I have some question about the detection of short string. I need to detect the language of text sent in a chat, and I am faced with 2 problems:

  • the lenght of the message
  • the errors that may be in it and the noise (emoji etc…)

but for the noise, I clean the message and that work fine but for the lenght of the message, it’s a problem. For exemple If a user say hi, fasttext detect the language as a deutch text, but google translate detect it to an english text. And the most likely it is a message in English. So I try to train my own fasttext model but how can I can adjust the model to have better result in short string? I need to train the model with dictionnary of a lot of language to have better result?

I use fasttext because it’s the most accurate language detector. Here is also an exemple of the problem with fasttext:

# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin

import fasttext

text = "Hi"

pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)

predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))
Asked By: Jourdelune

||

Answers:

In my experience, common approaches based on fastText or other classifiers struggle with short texts.

You could try lingua, a language detection library that is available for Python, Java, Go, and Rust.

Among its strengths:

…yields pretty accurate results on both long and short text, even on
single words and phrases.

She draws on both rule-based and statistical methods but does not use any dictionaries of words.

She does not need
a connection to any external API or service either.

As you can read here, it seems that in Lingua you can also restrict the set of languages to be considered

I have find a way to have better result. If you sum all probability of all languages on different detector like fastText and lingua and add for short text a detection with dictionnary. You can have very good result (for my task, I also made a fastText model trained on my data). I have made a demo for that but moderator don’t accept it so I can’t send the link of the repo.

Answered By: Jourdelune