Reducing noise on Data

Question:

I have 2 lists with data points in them.

x = ["bunch of data points"]
y = ["bunch of data points"]

I’ve generated a graph using matplotlib in python

import matplotlib.pyplot as plt

plt.plot(x, y, linewidth=2, linestyle="-", c="b")
plt.show()
plt.close()

Would I be able to reduce the noise on the data? Would a Kalman filter work here?

enter image description here

Asked By: PiccolMan

||

Answers:

It depends how you define the "noise" and how it is caused. Since you didn’t provide much information about your case, I’ll take your question as "how to make the curve smooth". Kalman filter can do this, but it’s too complex, I’d prefer simple IIR filter

import matplotlib.pyplot as plt
import numpy as np

mu, sigma = 0, 500

x = np.arange(1, 100, 0.1)  # x axis
z = np.random.normal(mu, sigma, len(x))  # noise
y = x ** 2 + z  # data
plt.plot(x, y, linewidth=2, linestyle="-", c="b")  # it includes some noise

enter image description here

After filter

from scipy.signal import lfilter

n = 15  # the larger n is, the smoother curve will be
b = [1.0 / n] * n
a = 1
yy = lfilter(b, a, y)
plt.plot(x, yy, linewidth=2, linestyle="-", c="b")  # smooth by filter

enter image description here

lfilter is a function from scipy.signal.

By the way, if you do want to use Kalman filter for smoothing, scipy also provides an example. Kalman filter should also work on this case, just not so necessary.

Answered By: Lyken

Depending on how much you like to remove the noise, you can also use the Savitzky-Golay filter from scipy.

The following takes the example from @lyken-syu:

import matplotlib.pyplot as plt
import numpy as np
mu, sigma = 0, 500
x = np.arange(1, 100, 0.1)  # x axis
z = np.random.normal(mu, sigma, len(x))  # noise
y = x ** 2 + z # data
plt.plot(x, y, linewidth=2, linestyle="-", c="b")  # it include some noise

enter image description here

and applies the Savitzky-Golay filter

from scipy.signal import savgol_filter
w = savgol_filter(y, 101, 2)
plt.plot(x, w, 'b')  # high frequency noise removed

window_length = 101

Increasing the window_length to 501:

window_length = 501

Read more about the filter here

Answered By: U3.1415926

If you are dealing with timeseries I suggest you tsmoothie: A python library for timeseries smoothing and outlier detection in a vectorized way.

It provides different smoothing algorithms together with the possibility to computes intervals.

Here I use a ConvolutionSmoother but you can also test it others. (Also KalmanSmoother is available)

import numpy as np
import matplotlib.pyplot as plt
from tsmoothie.smoother import *

mu, sigma = 0, 500
x = np.arange(1, 100, 0.1)  # x axis
z = np.random.normal(mu, sigma, len(x))  # noise
y = x ** 2 + z # data

# operate smoothing
smoother = ConvolutionSmoother(window_len=30, window_type='ones')
smoother.smooth(y)

# generate intervals
low, up = smoother.get_intervals('sigma_interval', n_sigma=3)

# plot the smoothed timeseries with intervals
plt.figure(figsize=(11,6))
plt.plot(smoother.data[0], color='orange')
plt.plot(smoother.smooth_data[0], linewidth=3, color='blue')
plt.fill_between(range(len(smoother.data[0])), low[0], up[0], alpha=0.3)

enter image description here

I point out also that tsmoothie can carry out the smoothing of multiple timeseries in a vectorized way

Answered By: Marco Cerliani

Depending on your end use, it may be worthwhile considering LOWESS (Locally Weighted Scatterplot Smoothing) to remove noise. I’ve used it successfully with repeated measures datasets.

More information on local regression methods, including LOWESS and LOESS, here.

Using the example data from @lyken-syu for consistency with other answers:

import numpy as np
import matplotlib.pyplot as plt

mu, sigma = 0, 500
x = np.arange(1, 100, 0.1)  # x axis
z = np.random.normal(mu, sigma, len(x))  # noise
y = x ** 2 + z  # signal + noise

plt.plot(x, y, linewidth = 2, linestyle = "-", c = "b")  # includes some noise
plt.show()

enter image description here

Here is how to apply the LOWESS technique using the statsmodels implementation:

import statsmodels.api as sm

y_lowess = sm.nonparametric.lowess(y, x, frac = 0.3)  # 30 % lowess smoothing

plt.plot(y_lowess[:, 0], y_lowess[:, 1], 'b')  # some noise removed
plt.show()

enter image description here

It may be necessary to vary the frac parameter, which is the fraction of the data used when estimating each y value. Increase the frac value to increase the amount of smoothing. The frac value must be between 0 and 1.

Further details on statsmodels lowess usage.


Sometimes a simple rolling mean may be all that is necessary.

For example, using pandas with a window size of 30:

import pandas as pd

df = pd.DataFrame(y, x)
df_mva = df.rolling(30).mean()  # moving average with a window size of 30

df_mva.plot(legend = False);

enter image description here

You will probably have to try several window sizes with your data.
Note that the first 30 values of df_mva will be NaN but these can be removed with the dropna method.

Usage details for the pandas rolling function.


Finally, interpolation can be used for noise reduction through smoothing.

Here is an example of radial basis function interpolation from scipy:

from scipy.interpolate import Rbf

rbf = Rbf(x, y, function = 'quintic', smooth = 10)

xnew = np.linspace(x.min(), x.max(), num = 100, endpoint = True)
ynew = rbf(xnew)

plt.plot(xnew, ynew)
plt.show()

enter image description here

Smoother approximation can be achieved by increasing the smooth parameter. Alternative function parameters to consider include ‘cubic’ and ‘thin_plate’. When considering the function value, I usually try ‘thin_plate’ first followed by ‘cubic’; ‘thin_plate’ gave good results but required a very high smooth value with this dataset and ‘cubic’ seemed to struggle with the noise.

Check other Rbf options in the scipy docs. Scipy provides other univariate and multivariate interpolation techniques (see this tutorial).


Both LOWESS and rolling mean methods will give better results if your data is sampled at a regular interval.

Radial basis function interpolation may be overkill for this dataset, but it’s definitely worth your attention if your data is higher dimensional and/or not sampled on a regular grid.

Care must be taken with all these methods; it’s easy to remove too much noise and distort the underlying signal.

Answered By: makeyourownmaker