Language detection for short string in a user content generated context
Question:
I have some question about the detection of short string. I need to detect the language of text sent in a chat, and I am faced with 2 problems:
- the lenght of the message
- the errors that may be in it and the noise (emoji etc…)
but for the noise, I clean the message and that work fine but for the lenght of the message, it’s a problem. For exemple If a user say hi, fasttext detect the language as a deutch text, but google translate detect it to an english text. And the most likely it is a message in English. So I try to train my own fasttext model but how can I can adjust the model to have better result in short string? I need to train the model with dictionnary of a lot of language to have better result?
I use fasttext because it’s the most accurate language detector. Here is also an exemple of the problem with fasttext:
# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
import fasttext
text = "Hi"
pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)
predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))
Answers:
In my experience, common approaches based on fastText or other classifiers struggle with short texts.
You could try lingua, a language detection library that is available for Python, Java, Go, and Rust.
Among its strengths:
…yields pretty accurate results on both long and short text, even on
single words and phrases.
She draws on both rule-based and statistical methods but does not use any dictionaries of words.
She does not need
a connection to any external API or service either.
As you can read here, it seems that in Lingua you can also restrict the set of languages to be considered
I have find a way to have better result. If you sum all probability of all languages on different detector like fastText and lingua and add for short text a detection with dictionnary. You can have very good result (for my task, I also made a fastText model trained on my data). I have made a demo for that but moderator don’t accept it so I can’t send the link of the repo.
I have some question about the detection of short string. I need to detect the language of text sent in a chat, and I am faced with 2 problems:
- the lenght of the message
- the errors that may be in it and the noise (emoji etc…)
but for the noise, I clean the message and that work fine but for the lenght of the message, it’s a problem. For exemple If a user say hi, fasttext detect the language as a deutch text, but google translate detect it to an english text. And the most likely it is a message in English. So I try to train my own fasttext model but how can I can adjust the model to have better result in short string? I need to train the model with dictionnary of a lot of language to have better result?
I use fasttext because it’s the most accurate language detector. Here is also an exemple of the problem with fasttext:
# wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
import fasttext
text = "Hi"
pretrained_lang_model = "lid.176.bin"
model = fasttext.load_model(pretrained_lang_model)
predictions = model.predict(text, k=2)
print(predictions)
# (('__label__de', '__label__en'), array([0.51606238, 0.31865335]))
In my experience, common approaches based on fastText or other classifiers struggle with short texts.
You could try lingua, a language detection library that is available for Python, Java, Go, and Rust.
Among its strengths:
…yields pretty accurate results on both long and short text, even on
single words and phrases.
She draws on both rule-based and statistical methods but does not use any dictionaries of words.
She does not need
a connection to any external API or service either.
As you can read here, it seems that in Lingua you can also restrict the set of languages to be considered
I have find a way to have better result. If you sum all probability of all languages on different detector like fastText and lingua and add for short text a detection with dictionnary. You can have very good result (for my task, I also made a fastText model trained on my data). I have made a demo for that but moderator don’t accept it so I can’t send the link of the repo.