How to get Mean and Standard deviation from a Frequency Distribution table
Question:
I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]
. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , std) for the above data.I would also like to plot a boxplot for the values.
I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.
Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?
Also: What is the best way to programmatically expand the above list of tuples to one list? (e.g if freq dist is [(1,3) , (50,2)]
, best way to get a list [1,1,1,50,50]
to use np.mean([1,1,1,50,50])
)?
I see a custom function here, but I would like to use a standard implementation if possible.
Answers:
-
To convert the (value, frequency) list to a list of values:
freqdist = [(1,3), (50,2)]
sum(([val,]*freq for val, freq in freqdist), [])
gives
[1, 1, 1, 50, 50]
-
To compute the mean you can avoid the building of the list of values by using np.average
which takes a weights
argument:
vals, freqs = np.array(freqdist).T
np.average(vals, weights = freqs)
gives 20.6 as you would expect. I don’t think this works for the mean, variance, or standard deviation, though.
First, I’d change that messy list into two numpy
arrays like @user8153 did:
val, freq = np.array(list_tuples).T
Then you can reconstruct the array (using np.repeat
prevent looping):
data = np.repeat(val, freq)
And use numpy
statistical functions on your data
array.
If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:
def mean_(val, freq):
return np.average(val, weights = freq)
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
def mode_(val, freq): #in the strictest sense, assuming unique mode
return val[np.argmax(freq)]
def var_(val, freq):
avg = mean_(val, freq)
dev = freq * (val - avg) ** 2
return dev.sum() / (freq.sum() - 1)
def std_(val, freq):
return np.sqrt(var_(val, freq))
import pandas as pd
import math
import numpy as np
Frequency Distributed Data
class freq
0 60-65 3
1 65-70 150
2 70-75 335
3 75-80 135
4 80-85 4
Create Middle point column for classes
df[['Upper','Lower']]=df['class'].str.split('-',expand=True)
df['Xi']=(df['Upper'].astype(float)+df['Lower'].astype(float))/2
df.drop(['Upper','Lower'],axis=1,inplace=True)
Therefore
class freq Xi
0 60-65 3 62.5
1 65-70 150 67.5
2 70-75 335 72.5
3 75-80 135 77.5
4 80-85 4 82.5
Mean
mean = np.average(df['Xi'], weights=df['freq'])
mean
72.396331738437
Standard Deviation
std = np.sqrt(np.average((df['Xi']-mean)**2,weights=df['freq']))
std
3.5311919641103877
I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]
. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , std) for the above data.I would also like to plot a boxplot for the values.
I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.
Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?
Also: What is the best way to programmatically expand the above list of tuples to one list? (e.g if freq dist is [(1,3) , (50,2)]
, best way to get a list [1,1,1,50,50]
to use np.mean([1,1,1,50,50])
)?
I see a custom function here, but I would like to use a standard implementation if possible.
-
To convert the (value, frequency) list to a list of values:
freqdist = [(1,3), (50,2)] sum(([val,]*freq for val, freq in freqdist), [])
gives
[1, 1, 1, 50, 50]
-
To compute the mean you can avoid the building of the list of values by using
np.average
which takes aweights
argument:vals, freqs = np.array(freqdist).T np.average(vals, weights = freqs)
gives 20.6 as you would expect. I don’t think this works for the mean, variance, or standard deviation, though.
First, I’d change that messy list into two numpy
arrays like @user8153 did:
val, freq = np.array(list_tuples).T
Then you can reconstruct the array (using np.repeat
prevent looping):
data = np.repeat(val, freq)
And use numpy
statistical functions on your data
array.
If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:
def mean_(val, freq):
return np.average(val, weights = freq)
def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]
def mode_(val, freq): #in the strictest sense, assuming unique mode
return val[np.argmax(freq)]
def var_(val, freq):
avg = mean_(val, freq)
dev = freq * (val - avg) ** 2
return dev.sum() / (freq.sum() - 1)
def std_(val, freq):
return np.sqrt(var_(val, freq))
import pandas as pd
import math
import numpy as np
Frequency Distributed Data
class freq
0 60-65 3
1 65-70 150
2 70-75 335
3 75-80 135
4 80-85 4
Create Middle point column for classes
df[['Upper','Lower']]=df['class'].str.split('-',expand=True)
df['Xi']=(df['Upper'].astype(float)+df['Lower'].astype(float))/2
df.drop(['Upper','Lower'],axis=1,inplace=True)
Therefore
class freq Xi
0 60-65 3 62.5
1 65-70 150 67.5
2 70-75 335 72.5
3 75-80 135 77.5
4 80-85 4 82.5
Mean
mean = np.average(df['Xi'], weights=df['freq'])
mean
72.396331738437
Standard Deviation
std = np.sqrt(np.average((df['Xi']-mean)**2,weights=df['freq']))
std
3.5311919641103877