How to make text classification gives a None category

Question:

I’m doing text classification for dialects. After I trained it for 3 types of dialects, I tested it with the test data I have. However, now suppose I’m going to extract a tweet from twitter, and ask the classifier to output the corresponding dialect, but what if the tweet wasn’t written in any of those 3 dialects? I assume that he will give a category regardless, but that would be false positive. Therefore, I want him to give a None category. How to do that? Should I also give training data with None labels?

Asked By: John Sall

Source

Answers:

If you want to predict a new category (in this case None) with the same classifier, you have to provide training data corresponding to this category.

Another idea (better discussed here: https://stats.stackexchange.com/questions/174856/semi-supervised-classification-with-unseen-classes) is to train a multi-class classifier which assigns a sentence to one of the dialects; then train various one-class classifiers, one for each dialect, which can confirm or deny multi-class classifier predictions.

An example:
Dialects A, B, C.

Multi-class classifier assigns sentence to dialect A.
One-class classifier for dialect A classifies sentence as dialect A.
Sentence belongs to dialect A.

Multi-class classifier assigns sentence to dialect A.
One-class classifier for dialect A classifies sentence as not dialect A.
Sentence belongs to unknown dialect (None).

Answered By: Stefano Fiorucci – anakin87

There are two quick and dirty approaches that may work here, depending on your data.

Gather enough representative data of the ‘unknown’ class and train your model to predict unknowns.
Train your model on the known classes only and, at inference time, threshold the minimum logit value you would need to make a definitive classification, assigning ‘unknown’ below that threshold.

#1 can in practise be quite difficult if ‘unknown’ covers some very heterogenous set of data i.e. is really a whole other set of classes rather than a single class.

#2 can work quite well in this case if your known classes are distinctive and it’s unlikely that you will encounter data in the wild that would mimic one of your classes.

Of course, you will still want to as much ‘unknown’ data at your thresholded model as possible in advance to test it out.

Answered By: John Curry