# How to get Mean and Standard deviation from a Frequency Distribution table

## Question:

I have a list of tuples `[(val1, freq1), (val2, freq2) .... (valn, freqn)]`. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , std) for the above data.I would also like to plot a boxplot for the values.

I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.

Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?

Also: What is the best way to programmatically expand the above list of tuples to one list? (e.g if freq dist is `[(1,3) , (50,2)]`, best way to get a list `[1,1,1,50,50]` to use `np.mean([1,1,1,50,50])`)?

I see a custom function here, but I would like to use a standard implementation if possible.

## Answers:

• To convert the (value, frequency) list to a list of values:

``````freqdist =  [(1,3), (50,2)]
sum(([val,]*freq for val, freq in freqdist), [])
``````

gives

``````[1, 1, 1, 50, 50]
``````
• To compute the mean you can avoid the building of the list of values by using `np.average` which takes a `weights` argument:

``````vals, freqs = np.array(freqdist).T
np.average(vals, weights = freqs)
``````

gives 20.6 as you would expect. I don’t think this works for the mean, variance, or standard deviation, though.

First, I’d change that messy list into two `numpy` arrays like @user8153 did:

``````val, freq = np.array(list_tuples).T
``````

Then you can reconstruct the array (using `np.repeat` prevent looping):

``````data = np.repeat(val, freq)
``````

And use `numpy` statistical functions on your `data` array.

If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:

``````def mean_(val, freq):
return np.average(val, weights = freq)

def median_(val, freq):
ord = np.argsort(val)
cdf = np.cumsum(freq[ord])
return val[ord][np.searchsorted(cdf, cdf[-1] // 2)]

def mode_(val, freq): #in the strictest sense, assuming unique mode
return val[np.argmax(freq)]

def var_(val, freq):
avg = mean_(val, freq)
dev = freq * (val - avg) ** 2
return dev.sum() / (freq.sum() - 1)

def std_(val, freq):
return np.sqrt(var_(val, freq))
``````
``````import pandas as pd
import math
import numpy as np
``````

Frequency Distributed Data

``````    class   freq
0   60-65   3
1   65-70   150
2   70-75   335
3   75-80   135
4   80-85   4
``````

Create Middle point column for classes

``````df[['Upper','Lower']]=df['class'].str.split('-',expand=True)
df['Xi']=(df['Upper'].astype(float)+df['Lower'].astype(float))/2
df.drop(['Upper','Lower'],axis=1,inplace=True)
``````

Therefore

``````    class   freq  Xi
0   60-65   3     62.5
1   65-70   150   67.5
2   70-75   335   72.5
3   75-80   135   77.5
4   80-85   4     82.5
``````

Mean

``````mean = np.average(df['Xi'], weights=df['freq'])
mean
72.396331738437
``````

Standard Deviation

``````std = np.sqrt(np.average((df['Xi']-mean)**2,weights=df['freq']))
std
3.5311919641103877
``````
Categories: questions Tags: , , ,
Answers are sorted by their score. The answer accepted by the question owner as the best is marked with
at the top-right corner.