How to normalize a non-normal distribution?

Question:

n a

I have the above distribution with a mean of -0.02, standard deviation of 0.09 and with a sample size of 13905.

I am just not sure why the distribution is is left-skewed given the large sample size. From bin [-2.0 to -0.5], there are only 10 sample count/outliers in that bin, which explains the shape.

I am just wondering is it possible to normalize to make it more smooth and ‘normal’ distribution. Purpose is to feed it into a model, while reducing the standard error of the predictor.

Asked By: Chipmunkafy

||

Answers:

You have two options here. You can either Box-Cox transform or Yeo-Johnson transform. The issue with Box-Cox transform is that it applies only to positive numbers. To use Box-Cox transform, you’ll have to take an exponential, perform the Box-Cox transform and then take the log to get the data in the original scale. Box-Cox transform is available in scipy.stats

You can avoid those steps and simply use Yeo-Johnson transform. sklearn provides an API for that

from matplotlib import pyplot as plt
from scipy.stats import normaltest
import numpy as np
from sklearn.preprocessing import PowerTransformer

data=np.array([-0.35714286,-0.28571429,-0.00257143,-0.00271429,-0.00142857,0.,0.,0.,0.00142857,0.00285714,0.00714286,0.00714286,0.01,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.01428571,0.02142857,0.07142857])

pt = PowerTransformer(method='yeo-johnson')
data = data.reshape(-1, 1)
pt.fit(data)
transformed_data = pt.transform(data)

We have transformed our data but we need a way to measure and see if we have moved in the right direction. Since our goal was to move towards being a normal distribution, we will use a normality test.

k2, p = normaltest(data)
transformed_k2, transformed_p = normaltest(transformed_data)

The test returns two values k2 and p. The value of p is of our interest here.
if p is greater than some threshold (ex 0.001 or so), we can say reject the hypothesis that data comes from a normal distribution.

In the example above, you’ll see that p is greater than 0.001 while transformed_p is less than this threshold indicating that we are moving in the right direction.

Answered By: Clock Slave

I agree with the top answer, except the last 2 paragraphs, because the interpretation of normaltest‘s output is flipped. These paragraphs should instead read:

"The test returns two values k2 and p. The value of p is of our interest here.
if p is greater less than some threshold (ex 0.001 or so), we can say reject the null hypothesis that data comes from a normal distribution.

In the example above, you’ll see that p is greater less than 0.001 while transformed_p is less greater than this threshold indicating that we are moving in the right direction."

Source: normaltest documentation.

Answered By: Marcel