Chi-Squared test in Python
Question:
I’ve used the following code in R
to determine how well observed values (20, 20, 0 and 0 for example) fit expected values/ratios (25% for each of the four cases, for example):
> chisq.test(c(20,20,0,0), p=c(0.25, 0.25, 0.25, 0.25))
Chi-squared test for given probabilities
data: c(20, 20, 0, 0)
X-squared = 40, df = 3, p-value = 1.066e-08
How can I replicate this in Python? I’ve tried using the chisquare
function from scipy
but the results I obtained were very different; I’m not sure if this is even the correct function to use. I’ve searched through the scipy
documentation, but it’s quite daunting as it runs to 1000+ pages; the numpy
documentation is almost 50% more than that.
Answers:
scipy.stats.chisquare
expects observed and expected absolute frequencies, not ratios. You can obtain what you want with
>>> observed = np.array([20., 20., 0., 0.])
>>> expected = np.array([.25, .25, .25, .25]) * np.sum(observed)
>>> chisquare(observed, expected)
(40.0, 1.065509033425585e-08)
Although in the case that the expected values are uniformly distributed over the classes, you can leave out the computation of the expected values:
>>> chisquare(observed)
(40.0, 1.065509033425585e-08)
The first returned value is the χ² statistic, the second the p-value of the test.
An alternative would be to call your R code from python. You can do this:
- by making an R script run as a command line tool. See this link for more information on running R scripts form the command line using
Rscript
. From python you can then run an R script by executing a system call using either subprocess
or os.system
. Any data exchange is done through text or binary files. I like this approach because it is very simple, and it is easy to debug the R script separate from the python code. The downside is that all data goes through the harddrive, which could prove to be very slow.
- by using rpy, or rpy2 to run R code directly from within python. In this way the integration is more tight, but this link also introduces its own little quirks. For example, in my experience debugging R code called through rpy is a little harder to debug.
Just wanted to point out that while the answer appears to be correct syntactically, you should not be using a Chi-squared distribution with your example because you have observed frequencies that are too small for an accurate Chi-square test.
“This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.” see:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare
I’ve used the following code in R
to determine how well observed values (20, 20, 0 and 0 for example) fit expected values/ratios (25% for each of the four cases, for example):
> chisq.test(c(20,20,0,0), p=c(0.25, 0.25, 0.25, 0.25))
Chi-squared test for given probabilities
data: c(20, 20, 0, 0)
X-squared = 40, df = 3, p-value = 1.066e-08
How can I replicate this in Python? I’ve tried using the chisquare
function from scipy
but the results I obtained were very different; I’m not sure if this is even the correct function to use. I’ve searched through the scipy
documentation, but it’s quite daunting as it runs to 1000+ pages; the numpy
documentation is almost 50% more than that.
scipy.stats.chisquare
expects observed and expected absolute frequencies, not ratios. You can obtain what you want with
>>> observed = np.array([20., 20., 0., 0.])
>>> expected = np.array([.25, .25, .25, .25]) * np.sum(observed)
>>> chisquare(observed, expected)
(40.0, 1.065509033425585e-08)
Although in the case that the expected values are uniformly distributed over the classes, you can leave out the computation of the expected values:
>>> chisquare(observed)
(40.0, 1.065509033425585e-08)
The first returned value is the χ² statistic, the second the p-value of the test.
An alternative would be to call your R code from python. You can do this:
- by making an R script run as a command line tool. See this link for more information on running R scripts form the command line using
Rscript
. From python you can then run an R script by executing a system call using eithersubprocess
oros.system
. Any data exchange is done through text or binary files. I like this approach because it is very simple, and it is easy to debug the R script separate from the python code. The downside is that all data goes through the harddrive, which could prove to be very slow. - by using rpy, or rpy2 to run R code directly from within python. In this way the integration is more tight, but this link also introduces its own little quirks. For example, in my experience debugging R code called through rpy is a little harder to debug.
Just wanted to point out that while the answer appears to be correct syntactically, you should not be using a Chi-squared distribution with your example because you have observed frequencies that are too small for an accurate Chi-square test.
“This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5.” see:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html#scipy.stats.chisquare