NumPy: calculate averages with NaNs removed
Question:
How can I calculate matrix mean values along a matrix, but to remove nan
values from calculation? (For R people, think na.rm = TRUE
).
Here is my [non-]working example:
import numpy as np
dat = np.array([[1, 2, 3],
[4, 5, np.nan],
[np.nan, 6, np.nan],
[np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1)) # [ 2. nan nan nan]
With NaNs removed, my expected output would be:
array([ 2., 4.5, 6., nan])
Answers:
Assuming you’ve also got SciPy installed:
http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean
I think what you want is a masked array:
dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer
Edit: Combining all of the timing data
from timeit import Timer
setupstr="""
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""
method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
"""
N = 2
t1 = Timer(method1, setupstr).timeit(N)
t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
print 'Time: %ftRatio: %f' % (t1,t1/t1 )
print 'Time: %ftRatio: %f' % (t2,t2/t1 )
print 'Time: %ftRatio: %f' % (t3,t3/t1 )
print 'Time: %ftRatio: %f' % (t4,t4/t1 )
print 'Time: %ftRatio: %f' % (t5,t5/t1 )
Returns:
Time: 0.045454 Ratio: 1.000000
Time: 8.179479 Ratio: 179.950595
Time: 0.060988 Ratio: 1.341755
Time: 0.070955 Ratio: 1.561029
Time: 0.065152 Ratio: 1.433364
If performance matters, you should use bottleneck.nanmean()
instead:
A masked array with the nans filtered out can also be created on the fly:
print np.ma.masked_invalid(dat).mean(1)
You can always find a workaround in something like:
numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)
Numpy 2.0’s numpy.mean
has a skipna
option which should take care of that.
This is built upon the solution suggested by JoshAdel.
Define the following function:
def nanmean(data, **args):
return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)
Example use:
data = [[0, 1, numpy.nan], [8, 5, 1]]
data = numpy.array(data)
print data
print nanmean(data)
print nanmean(data, axis=0)
print nanmean(data, axis=1)
Will print out:
[[ 0. 1. nan]
[ 8. 5. 1.]]
3.0
[ 4. 3. 1.]
[ 0.5 4.66666667]
Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.
import laxarray as la
la.array(dat).mean(axis=1)
following JoshAdel’s protocoll I get:
Time: 0.048791 Ratio: 1.000000
Time: 0.062242 Ratio: 1.275689 # laxarray's one-liner
So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.
check out: https://github.com/perrette/laxarray
EDIT: I have checked with another module, “la”, larry, which beats all tests:
import la
la.larry(dat).mean(axis=1)
By hand, Time: 0.049013 Ratio: 1.000000
Larry, Time: 0.005467 Ratio: 0.111540
laxarray Time: 0.061751 Ratio: 1.259889
Impressive !
How about using Pandas to do this:
import numpy as np
import pandas as pd
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
print dat
print dat.mean(1)
df = pd.DataFrame(dat)
print df.mean(axis=1)
Gives:
0 2.0
1 4.5
2 6.0
3 NaN
One more speed check for all proposed approaches:
Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSC v.1500 64 bit (AMD64)]
IPython 4.0.1 -- An enhanced Interactive Python.
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
In[185]: def method1():
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
In[190]: %timeit method1()
100 loops, best of 3: 7.09 ms per loop
In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat]
1 loops, best of 3: 1.04 s per loop
In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat])
10 loops, best of 3: 19.6 ms per loop
In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1)
100 loops, best of 3: 11.8 ms per loop
In[194]: %timeit nanmean(dat,axis=1)
100 loops, best of 3: 6.36 ms per loop
In[195]: import bottleneck as bn
In[196]: %timeit bn.nanmean(dat,axis=1)
1000 loops, best of 3: 1.05 ms per loop
In[197]: from scipy import stats
In[198]: %timeit stats.nanmean(dat)
100 loops, best of 3: 6.19 ms per loop
So the best is ‘bottleneck.nanmean(dat, axis=1)’
‘scipy.stats.nanmean(dat)’ is not faster then numpy.nanmean(dat, axis=1)
.
From numpy 1.8 (released 2013-10-30) onwards, nanmean
does precisely what you need:
>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5
'''define dataMat'''
numFeat= shape(datMat)[1]
for i in range(numFeat):
meanVal=mean(dataMat[nonzero(~isnan(datMat[:,i].A))[0],i])
# I suggest you this way:
import numpy as np
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
dat2 = np.ma.masked_invalid(dat)
print np.mean(dat2, axis=1)
How can I calculate matrix mean values along a matrix, but to remove nan
values from calculation? (For R people, think na.rm = TRUE
).
Here is my [non-]working example:
import numpy as np
dat = np.array([[1, 2, 3],
[4, 5, np.nan],
[np.nan, 6, np.nan],
[np.nan, np.nan, np.nan]])
print(dat)
print(dat.mean(1)) # [ 2. nan nan nan]
With NaNs removed, my expected output would be:
array([ 2., 4.5, 6., nan])
Assuming you’ve also got SciPy installed:
http://www.scipy.org/doc/api_docs/SciPy.stats.stats.html#nanmean
I think what you want is a masked array:
dat = np.array([[1,2,3], [4,5,'nan'], ['nan',6,'nan'], ['nan','nan','nan']])
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
print mm.filled(np.nan) # the desired answer
Edit: Combining all of the timing data
from timeit import Timer
setupstr="""
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
"""
method1="""
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
"""
N = 2
t1 = Timer(method1, setupstr).timeit(N)
t2 = Timer("[np.mean([l for l in d if not np.isnan(l)]) for d in dat]", setupstr).timeit(N)
t3 = Timer("np.array([r[np.isfinite(r)].mean() for r in dat])", setupstr).timeit(N)
t4 = Timer("np.ma.masked_invalid(dat).mean(axis=1)", setupstr).timeit(N)
t5 = Timer("nanmean(dat,axis=1)", setupstr).timeit(N)
print 'Time: %ftRatio: %f' % (t1,t1/t1 )
print 'Time: %ftRatio: %f' % (t2,t2/t1 )
print 'Time: %ftRatio: %f' % (t3,t3/t1 )
print 'Time: %ftRatio: %f' % (t4,t4/t1 )
print 'Time: %ftRatio: %f' % (t5,t5/t1 )
Returns:
Time: 0.045454 Ratio: 1.000000
Time: 8.179479 Ratio: 179.950595
Time: 0.060988 Ratio: 1.341755
Time: 0.070955 Ratio: 1.561029
Time: 0.065152 Ratio: 1.433364
If performance matters, you should use bottleneck.nanmean()
instead:
A masked array with the nans filtered out can also be created on the fly:
print np.ma.masked_invalid(dat).mean(1)
You can always find a workaround in something like:
numpy.nansum(dat, axis=1) / numpy.sum(numpy.isfinite(dat), axis=1)
Numpy 2.0’s numpy.mean
has a skipna
option which should take care of that.
This is built upon the solution suggested by JoshAdel.
Define the following function:
def nanmean(data, **args):
return numpy.ma.filled(numpy.ma.masked_array(data,numpy.isnan(data)).mean(**args), fill_value=numpy.nan)
Example use:
data = [[0, 1, numpy.nan], [8, 5, 1]]
data = numpy.array(data)
print data
print nanmean(data)
print nanmean(data, axis=0)
print nanmean(data, axis=1)
Will print out:
[[ 0. 1. nan]
[ 8. 5. 1.]]
3.0
[ 4. 3. 1.]
[ 0.5 4.66666667]
Or you use laxarray, freshly uploaded, which is among other a wrapper for masked arrays.
import laxarray as la
la.array(dat).mean(axis=1)
following JoshAdel’s protocoll I get:
Time: 0.048791 Ratio: 1.000000
Time: 0.062242 Ratio: 1.275689 # laxarray's one-liner
So laxarray is marginally slower (would need to check why, maybe fixable), but much easier to use and allow labelling dimensions with strings.
check out: https://github.com/perrette/laxarray
EDIT: I have checked with another module, “la”, larry, which beats all tests:
import la
la.larry(dat).mean(axis=1)
By hand, Time: 0.049013 Ratio: 1.000000
Larry, Time: 0.005467 Ratio: 0.111540
laxarray Time: 0.061751 Ratio: 1.259889
Impressive !
How about using Pandas to do this:
import numpy as np
import pandas as pd
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
print dat
print dat.mean(1)
df = pd.DataFrame(dat)
print df.mean(axis=1)
Gives:
0 2.0
1 4.5
2 6.0
3 NaN
One more speed check for all proposed approaches:
Python 2.7.11 |Anaconda 2.4.1 (64-bit)| (default, Jan 19 2016, 12:08:31) [MSC v.1500 64 bit (AMD64)]
IPython 4.0.1 -- An enhanced Interactive Python.
import numpy as np
from scipy.stats.stats import nanmean
dat = np.random.normal(size=(1000,1000))
ii = np.ix_(np.random.randint(0,99,size=50),np.random.randint(0,99,size=50))
dat[ii] = np.nan
In[185]: def method1():
mdat = np.ma.masked_array(dat,np.isnan(dat))
mm = np.mean(mdat,axis=1)
mm.filled(np.nan)
In[190]: %timeit method1()
100 loops, best of 3: 7.09 ms per loop
In[191]: %timeit [np.mean([l for l in d if not np.isnan(l)]) for d in dat]
1 loops, best of 3: 1.04 s per loop
In[192]: %timeit np.array([r[np.isfinite(r)].mean() for r in dat])
10 loops, best of 3: 19.6 ms per loop
In[193]: %timeit np.ma.masked_invalid(dat).mean(axis=1)
100 loops, best of 3: 11.8 ms per loop
In[194]: %timeit nanmean(dat,axis=1)
100 loops, best of 3: 6.36 ms per loop
In[195]: import bottleneck as bn
In[196]: %timeit bn.nanmean(dat,axis=1)
1000 loops, best of 3: 1.05 ms per loop
In[197]: from scipy import stats
In[198]: %timeit stats.nanmean(dat)
100 loops, best of 3: 6.19 ms per loop
So the best is ‘bottleneck.nanmean(dat, axis=1)’
‘scipy.stats.nanmean(dat)’ is not faster then numpy.nanmean(dat, axis=1)
.
From numpy 1.8 (released 2013-10-30) onwards, nanmean
does precisely what you need:
>>> import numpy as np
>>> np.nanmean(np.array([1.5, 3.5, np.nan]))
2.5
'''define dataMat'''
numFeat= shape(datMat)[1]
for i in range(numFeat):
meanVal=mean(dataMat[nonzero(~isnan(datMat[:,i].A))[0],i])
# I suggest you this way:
import numpy as np
dat = np.array([[1, 2, 3], [4, 5, np.nan], [np.nan, 6, np.nan], [np.nan, np.nan, np.nan]])
dat2 = np.ma.masked_invalid(dat)
print np.mean(dat2, axis=1)