Outliers in Categorical Data?

Question:

I am unable to find a solution to find outliers in categorical data. My data consists of combinations of rows. I want to mark outliers that differ in certain combinations.

In the above question as specified, I cannot cluster the data as a nonoutlier data row and the outlier row consisting of the same frequency.

My data looks something like this:

	c1	c2	c3	c4
row1	A	B	C	D
row2	A	B	C	D
row3	A	D	C	G
row4	NU	D	E	G
row6	NU	D	E	X

Please suggest a valid logic to solve the issue.
I also tried to distribute the data based on frequency but I’m unable to assign a threshold as I’m unable to find a value to consider the data as outliers. Providing a way to find thresholds also can help.

Asked By: JUGAL KISHORE

Source

Answers:

There are no outlier detection methods for categorical data. The notion means nothing in this case. You might think like that:

You have a sample of 10 with 9 females and 1 male. You might think the male is the outlier it’s just the composition of your sample, not an outlier.

For an outlier to exist there must be a measure of distance between the items. Have a look at this for more information.

Please suggest a valid logic to solve the issue. I Also tried to distribute the data based on frquency but i’m unable to assign a thresold as im unable to find a value to consider the data as outliers.Providing a way to find thresold also can help.

A solution could be to just value_counts your column so then you have the frequency of each element.

Answered By: A. Traoré

According to the tags you assigned, I guess you want to perform one-hot encoding in a later step. In this case you can use sklearn‘s OneHotEncoder and specify the min_frequency parameter. If you specified the min_frequency parameter, rare categorical values will be assigned 'infrequend_sklearn'.

Answered By: Jan