How to get Mean and Standard deviation from a Frequency Distribution table

Question:

I have a list of tuples [(val1, freq1), (val2, freq2) .... (valn, freqn)]. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , std) for the above data.I would also like to plot a boxplot for the values.

I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.

Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?

Also: What is the best way to programmatically expand the above list of tuples to one list? (e.g if freq dist is [(1,3) , (50,2)], best way to get a list [1,1,1,50,50] to use np.mean([1,1,1,50,50]))?

I see a custom function here, but I would like to use a standard implementation if possible.

Asked By: jithu83

||

Answers:

  • To convert the (value, frequency) list to a list of values:

    freqdist =  [(1,3), (50,2)]
    sum(([val,]*freq for val, freq in freqdist), []) 
    

    gives

    [1, 1, 1, 50, 50]
    
  • To compute the mean you can avoid the building of the list of values by using np.average which takes a weights argument:

    vals, freqs = np.array(freqdist).T
    np.average(vals, weights = freqs)
    

    gives 20.6 as you would expect. I don’t think this works for the mean, variance, or standard deviation, though.

Answered By: user8153

First, I’d change that messy list into two numpy arrays like @user8153 did:

val, freq = np.array(list_tuples).T

Then you can reconstruct the array (using np.repeat prevent looping):

data = np.repeat(val, freq)

And use numpy statistical functions on your data array.


If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:

def mean_(val, freq):
    return np.average(val, weights = freq)

def median_(val, freq):
    ord = np.argsort(val)
    cdf = np.cumsum(freq[ord])
    return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

def mode_(val, freq): #in the strictest sense, assuming unique mode
    return val[np.argmax(freq)]

def var_(val, freq):
    avg = mean_(val, freq)
    dev = freq * (val - avg) ** 2
    return dev.sum() / (freq.sum() - 1)

def std_(val, freq):
    return np.sqrt(var_(val, freq))
Answered By: Daniel F
import pandas as pd
import math
import numpy as np

Frequency Distributed Data

    class   freq
0   60-65   3
1   65-70   150
2   70-75   335
3   75-80   135
4   80-85   4

Create Middle point column for classes

df[['Upper','Lower']]=df['class'].str.split('-',expand=True)
df['Xi']=(df['Upper'].astype(float)+df['Lower'].astype(float))/2
df.drop(['Upper','Lower'],axis=1,inplace=True)

Therefore

    class   freq  Xi
0   60-65   3     62.5
1   65-70   150   67.5
2   70-75   335   72.5
3   75-80   135   77.5
4   80-85   4     82.5

Mean

mean = np.average(df['Xi'], weights=df['freq'])
mean
72.396331738437

Standard Deviation

std = np.sqrt(np.average((df['Xi']-mean)**2,weights=df['freq']))
std
3.5311919641103877
Answered By: Aditya Rajgor
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.