I have a list of tuples
[(val1, freq1), (val2, freq2) .... (valn, freqn)]. I need to get measures of central tendencies (mean, median ) and measures of deviation (variance , std) for the above data.I would also like to plot a boxplot for the values.
I see that numpy arrays have direct methods for getting mean / median and standard deviation (or variance) from list of values.
Does numpy (or any other well-known library) have a direct means to operate on such a frequency distribution table ?
Also: What is the best way to programmatically expand the above list of tuples to one list? (e.g if freq dist is
[(1,3) , (50,2)], best way to get a list
[1,1,1,50,50] to use
I see a custom function here, but I would like to use a standard implementation if possible.
To convert the (value, frequency) list to a list of values:
freqdist = [(1,3), (50,2)] sum(([val,]*freq for val, freq in freqdist), )
[1, 1, 1, 50, 50]
To compute the mean you can avoid the building of the list of values by using
np.average which takes a
vals, freqs = np.array(freqdist).T np.average(vals, weights = freqs)
gives 20.6 as you would expect. I don’t think this works for the mean, variance, or standard deviation, though.
First, I’d change that messy list into two
numpy arrays like @user8153 did:
val, freq = np.array(list_tuples).T
Then you can reconstruct the array (using
np.repeat prevent looping):
data = np.repeat(val, freq)
numpy statistical functions on your
If that causes memory errors (or you just want to squeeze out as much performance as possible), you can also use some purpose-built functions:
def mean_(val, freq): return np.average(val, weights = freq) def median_(val, freq): ord = np.argsort(val) cdf = np.cumsum(freq[ord]) return val[ord][np.searchsorted(cdf, cdf[-1] // 2)] def mode_(val, freq): #in the strictest sense, assuming unique mode return val[np.argmax(freq)] def var_(val, freq): avg = mean_(val, freq) dev = freq * (val - avg) ** 2 return dev.sum() / (freq.sum() - 1) def std_(val, freq): return np.sqrt(var_(val, freq))
import pandas as pd import math import numpy as np
Frequency Distributed Data
class freq 0 60-65 3 1 65-70 150 2 70-75 335 3 75-80 135 4 80-85 4
Create Middle point column for classes
df[['Upper','Lower']]=df['class'].str.split('-',expand=True) df['Xi']=(df['Upper'].astype(float)+df['Lower'].astype(float))/2 df.drop(['Upper','Lower'],axis=1,inplace=True)
class freq Xi 0 60-65 3 62.5 1 65-70 150 67.5 2 70-75 335 72.5 3 75-80 135 77.5 4 80-85 4 82.5
mean = np.average(df['Xi'], weights=df['freq']) mean 72.396331738437
std = np.sqrt(np.average((df['Xi']-mean)**2,weights=df['freq'])) std 3.5311919641103877