How can I get descriptive statistics of a NumPy array?

Question:

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

dataset = np.genfromtxt("data.csv", delimiter=",",dtype=('|S1', float, float,float,float,float,float,float,int))

Now I would like to get some descriptive statistics for each column (min, max, stdev, mean, median, etc.). Shouldn’t there be an easy way to do this?

I tried this:

from scipy import stats
stats.describe(dataset)

but this returns an error: TypeError: cannot perform reduce with flexible type

How can I get descriptive statistics of the created NumPy array?

Asked By: beta

||

Answers:

This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void), which cannot be described by stats as it includes multiple different types, incl. strings.

This could be resolved by either reading it in two rounds, or using pandas with read_csv.

If you decide to stick to numpy:

import numpy as np
a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')

from scipy import stats
for arr in a: #do not need the loop at this point, but looks prettier
    print(stats.describe(arr))
#Output per print:
DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)

Note that in this example the final array has dtype as float, not int, but can easily (if necessary) be converted to int using arr.astype(int)

Answered By: M.T

The question of how to deal with mixed data from genfromtxt comes up often. People expect a 2d array, and instead get a 1d that they can’t index by column. That’s because they get a structured array – with different dtype for each column.

All the examples in the genfromtxt doc show this:

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

But let me demonstrate how to access this kind of data

In [361]: txt=b"""A, 1,2,3
     ...: B,4,5,6
     ...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]: 
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

So my array has 2 records (check the shape), which are displayed as tuples in a list.

You access fields by name, not by column number (do I need to add a structured array documentation link?)

In [364]: data['f0']
Out[364]: 
array([b'A', b'B'], 
      dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])

In a case like this might be more useful if I choose a dtype with ‘subarrays’. This a more advanced dtype topic

In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]: 
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
      dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
In [369]: data['f1']
Out[369]: 
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2, 
   minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
   mean=array([ 2.5,  3.5,  4.5]), 
   variance=array([ 4.5,  4.5,  4.5]), 
   skewness=array([ 0.,  0.,  0.]), 
   kurtosis=array([-2., -2., -2.]))
Answered By: hpaulj
import pandas as pd
import numpy as np

df_describe = pd.DataFrame(dataset)
df_describe.describe()

please note that dataset is your np.array to describe.

import pandas as pd
import numpy as np

df_describe = pd.DataFrame('your np.array')
df_describe.describe()
Answered By: INNO TECH

Official Scipy Documentation Example

#INPUT
from scipy import stats
a = np.arange(10)
stats.describe(a)

#OUTPUT
DescribeResult(nobs=10, minmax=(0, 9), mean=4.5, variance=9.166666666666666,
               skewness=0.0, kurtosis=-1.2242424242424244)

#INPUT
b = [[1, 2], [3, 4]]
stats.describe(b)

#OUTPUT
DescribeResult(nobs=2, minmax=(array([1, 2]), array([3, 4])),
               mean=array([2., 3.]), variance=array([2., 2.]),
               skewness=array([0., 0.]), kurtosis=array([-2., -2.]))
Answered By: sogu