How can I calculate the variance of a list in python?
Question:
If I have a list like this:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
I want to calculate the variance of this list in Python which is the average of the squared differences from the mean.
How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences.
Answers:
You can use numpy’s built-in function var
:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
If – for whatever reason – you cannot use numpy
and/or you don’t want to use a built-in function for it, you can also calculate it “by hand” using e.g. a list comprehension:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
If you are interested in the standard deviation, you can use numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very well the difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
np.var(results, ddof=1)
The “by hand” solution is given in @Serge Ballesta’s answer.
Both approaches yield 32.024849178421285
.
You can set the parameter also for std
:
np.std(results, ddof=1)
5.659050201086865
Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.
The difference between the 2 is whether the value m = sum(xi) / n
is the real average or whether it is just an approximation of what the average should be.
Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n
is the real average, and the formulas given by Cleb are ok (variance n).
Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n
is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1
varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance
Note: ordinarily, statistical libraries like numpy use the variance n for what they call var
or variance
, and the variance n-1 for the function that gives the standard deviation.
Numpy is indeed the most elegant and fast way to do it.
I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
import numpy as np
print 'numpy variance: ', np.var(results)
# without numpy by hand
# there are two ways of calculating the variance
# - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
# - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)
# calculate mean
n= len(results)
sum=0
for i in range(n):
sum = sum+ results[i]
mean=sum/n
print 'mean: ', mean
# calculate the central moment
sum2=0
for i in range(n):
sum2=sum2+ (results[i]-mean)**2
myvar1=sum2/n
print "my variance1: ", myvar1
# calculate the mean of square minus square of mean
sum3=0
for i in range(n):
sum3=sum3+ results[i]**2
myvar2 = sum3/n - mean**2
print "my variance2: ", myvar2
gives you:
numpy variance: 28.8223642606
mean: -3.731599805
my variance1: 28.8223642606
my variance2: 28.8223642606
Starting Python 3.4
, the standard library comes with the variance
function (sample variance or variance n-1) as part of the statistics
module:
from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285
The population variance (or variance n) can be obtained using the pvariance
function:
from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157
Also note that if you already know the mean of your list, the variance
and pvariance
functions take a second argument (respectively xbar
and mu
) in order to spare recomputing the mean of the sample (which is part of the variance computation).
The correct answer is to use one of the packages like NumPy, but if you want to roll your own, and you want to do incrementally, there is a good algorithm that has higher accuracy. See this link https://www.johndcook.com/blog/standard_deviation/
I ported my perl implementation to Python. Please point out issues in the comments.
Mklast = 0
Mk = 0
Sk = 0
k = 0
for xi in results:
k = k +1
Mk = Mklast + (xi - Mklast) / k
Sk = Sk + (xi - Mklast) * ( xi - Mk)
Mklast = Mk
var = Sk / (k -1)
print var
Answer is
>>> print var
32.0248491784
import numpy as np
def get_variance(xs):
mean = np.mean(xs)
summed = 0
for x in xs:
summed += (x - mean)**2
return summed / (len(xs))
print(get_variance([1,2,3,4,5]))
out 2.0
a = [1,2,3,4,5]
variance = np.var(a, ddof=1)
print(variance)
Without imports, I would use the following python3 script:
#!/usr/bin/env python3
def createData():
data1=[12,54,60,3,15,6,36]
data2=[1,2,3,4,5]
data3=[100,30000,1567,3467,20000,23457,400,1,15]
dataset=[]
dataset.append(data1)
dataset.append(data2)
dataset.append(data3)
return dataset
def calculateMean(data):
means=[]
# one list of the nested list
for oneDataset in data:
sum=0
mean=0
# one datapoint in one inner list
for number in oneDataset:
# summing up
sum+=number
# mean for one inner list
mean=sum/len(oneDataset)
# adding a tuples of the original data and their mean to
# a list of tuples
item=(oneDataset, mean)
means.append(item)
return means
# to do: substract mean from each element and square the result
# sum up the square results and divide by number of elements
def calculateVariance(meanData):
variances=[]
# meanData is the list of tuples
# pair is one tuple
for pair in meanData:
# pair[0] is the original data
interResult=0
squareSum=0
for element in pair[0]:
interResult=(element-pair[1])**2
squareSum+=interResult
variance=squareSum/len(pair[0])
variances.append((pair[0], pair[1], variance))
return variances
def main():
my_data=createData()
my_means=calculateMean(my_data)
my_variances=calculateVariance(my_means)
print(my_variances)
if __name__ == "__main__":
main()
here you get a print of the original data, their mean and the variance. I know this approach covers a list of several datasets, yet I think you can adapt it quickly for your purpose 😉
Here’s my solutions
vac_nums = [0,0,0,0,0,
1,1,1,1,1,1,1,1,
2,2,2,2,
3,3,3
]
#your code goes here
mean = sum(vac_nums)/len(vac_nums);
count=0;
for i in range(len(vac_nums)):
variance = (vac_nums[i]-mean)**2;
count += variance;
print (count/len(vac_nums));
sometimes all I wanna do it shut my brain off and COPY PASTE
import math
def get_mean_var(results):
# calculate mean
mean = round(sum(results) / len(results), 2)
# calculate variance using a list comprehension
var = round(sum((xi - mean) ** 2 for xi in results) / len(results), 2)
return mean, var
USAGE
get_mean_var([1,3,34])
(12.67, 15.11)
If I have a list like this:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
I want to calculate the variance of this list in Python which is the average of the squared differences from the mean.
How can I go about this? Accessing the elements in the list to do the computations is confusing me for getting the square differences.
You can use numpy’s built-in function var
:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
If – for whatever reason – you cannot use numpy
and/or you don’t want to use a built-in function for it, you can also calculate it “by hand” using e.g. a list comprehension:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
If you are interested in the standard deviation, you can use numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very well the difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
np.var(results, ddof=1)
The “by hand” solution is given in @Serge Ballesta’s answer.
Both approaches yield 32.024849178421285
.
You can set the parameter also for std
:
np.std(results, ddof=1)
5.659050201086865
Well, there are two ways for defining the variance. You have the variance n that you use when you have a full set, and the variance n-1 that you use when you have a sample.
The difference between the 2 is whether the value m = sum(xi) / n
is the real average or whether it is just an approximation of what the average should be.
Example1 : you want to know the average height of the students in a class and its variance : ok, the value m = sum(xi) / n
is the real average, and the formulas given by Cleb are ok (variance n).
Example2 : you want to know the average hour at which a bus passes at the bus stop and its variance. You note the hour for a month, and get 30 values. Here the value m = sum(xi) / n
is only an approximation of the real average, and that approximation will be more accurate with more values. In that case the best approximation for the actual variance is the variance n-1
varRes = sum([(xi - m)**2 for xi in results]) / (len(results) -1)
Ok, it has nothing to do with Python, but it does have an impact on statistical analysis, and the question is tagged statistics and variance
Note: ordinarily, statistical libraries like numpy use the variance n for what they call var
or variance
, and the variance n-1 for the function that gives the standard deviation.
Numpy is indeed the most elegant and fast way to do it.
I think the actual question was about how to access the individual elements of a list to do such a calculation yourself, so below an example:
results=[-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
import numpy as np
print 'numpy variance: ', np.var(results)
# without numpy by hand
# there are two ways of calculating the variance
# - 1. direct as central 2nd order moment (https://en.wikipedia.org/wiki/Moment_(mathematics))divided by the length of the vector
# - 2. "mean of square minus square of mean" (see https://en.wikipedia.org/wiki/Variance)
# calculate mean
n= len(results)
sum=0
for i in range(n):
sum = sum+ results[i]
mean=sum/n
print 'mean: ', mean
# calculate the central moment
sum2=0
for i in range(n):
sum2=sum2+ (results[i]-mean)**2
myvar1=sum2/n
print "my variance1: ", myvar1
# calculate the mean of square minus square of mean
sum3=0
for i in range(n):
sum3=sum3+ results[i]**2
myvar2 = sum3/n - mean**2
print "my variance2: ", myvar2
gives you:
numpy variance: 28.8223642606
mean: -3.731599805
my variance1: 28.8223642606
my variance2: 28.8223642606
Starting Python 3.4
, the standard library comes with the variance
function (sample variance or variance n-1) as part of the statistics
module:
from statistics import variance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
variance(data)
# 32.024849178421285
The population variance (or variance n) can be obtained using the pvariance
function:
from statistics import pvariance
# data = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439, 0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
pvariance(data)
# 28.822364260579157
Also note that if you already know the mean of your list, the variance
and pvariance
functions take a second argument (respectively xbar
and mu
) in order to spare recomputing the mean of the sample (which is part of the variance computation).
The correct answer is to use one of the packages like NumPy, but if you want to roll your own, and you want to do incrementally, there is a good algorithm that has higher accuracy. See this link https://www.johndcook.com/blog/standard_deviation/
I ported my perl implementation to Python. Please point out issues in the comments.
Mklast = 0
Mk = 0
Sk = 0
k = 0
for xi in results:
k = k +1
Mk = Mklast + (xi - Mklast) / k
Sk = Sk + (xi - Mklast) * ( xi - Mk)
Mklast = Mk
var = Sk / (k -1)
print var
Answer is
>>> print var
32.0248491784
import numpy as np
def get_variance(xs):
mean = np.mean(xs)
summed = 0
for x in xs:
summed += (x - mean)**2
return summed / (len(xs))
print(get_variance([1,2,3,4,5]))
out 2.0
a = [1,2,3,4,5]
variance = np.var(a, ddof=1)
print(variance)
Without imports, I would use the following python3 script:
#!/usr/bin/env python3
def createData():
data1=[12,54,60,3,15,6,36]
data2=[1,2,3,4,5]
data3=[100,30000,1567,3467,20000,23457,400,1,15]
dataset=[]
dataset.append(data1)
dataset.append(data2)
dataset.append(data3)
return dataset
def calculateMean(data):
means=[]
# one list of the nested list
for oneDataset in data:
sum=0
mean=0
# one datapoint in one inner list
for number in oneDataset:
# summing up
sum+=number
# mean for one inner list
mean=sum/len(oneDataset)
# adding a tuples of the original data and their mean to
# a list of tuples
item=(oneDataset, mean)
means.append(item)
return means
# to do: substract mean from each element and square the result
# sum up the square results and divide by number of elements
def calculateVariance(meanData):
variances=[]
# meanData is the list of tuples
# pair is one tuple
for pair in meanData:
# pair[0] is the original data
interResult=0
squareSum=0
for element in pair[0]:
interResult=(element-pair[1])**2
squareSum+=interResult
variance=squareSum/len(pair[0])
variances.append((pair[0], pair[1], variance))
return variances
def main():
my_data=createData()
my_means=calculateMean(my_data)
my_variances=calculateVariance(my_means)
print(my_variances)
if __name__ == "__main__":
main()
here you get a print of the original data, their mean and the variance. I know this approach covers a list of several datasets, yet I think you can adapt it quickly for your purpose 😉
Here’s my solutions
vac_nums = [0,0,0,0,0,
1,1,1,1,1,1,1,1,
2,2,2,2,
3,3,3
]
#your code goes here
mean = sum(vac_nums)/len(vac_nums);
count=0;
for i in range(len(vac_nums)):
variance = (vac_nums[i]-mean)**2;
count += variance;
print (count/len(vac_nums));
sometimes all I wanna do it shut my brain off and COPY PASTE
import math
def get_mean_var(results):
# calculate mean
mean = round(sum(results) / len(results), 2)
# calculate variance using a list comprehension
var = round(sum((xi - mean) ** 2 for xi in results) / len(results), 2)
return mean, var
USAGE
get_mean_var([1,3,34])
(12.67, 15.11)