p value generated by scipy.stats.chi2_contingency for independence testing

Question:

For testing if two features are independent or not,
H0: A and B are independent
H1: A and B are dependent

p < 0.05, then A and B are dependent

Upon trying the following code, where it is very clear that the two arrays are dependent(they are the same arrays)

obs = np.array([[10, 10, 10], [10, 10, 10]])
scipy.stats.chi2_contingency(obs)

I get the following result:

(0.0, 1.0, 2, array([[10., 10., 10.],
        [10., 10., 10.]]))

i.e. p value is 1.0 > 0.05, So we accept the null hypothesis that the two arrays are independent of each other.

Is there an assumption I got wrong or is it generating 1-p values?

Asked By: Phoenix

||

Answers:

The computation you get is correct. It only means that the variables you have are independent and does not have association or connected to each other. Independence of events means it will not affect or influence the occurrence of another event.

In your example, all probability values are the same so in terms of probability the event of getting event A does not depend on another event B.

  P(A|B) = P(A)  or P(B|A) = P(B)

which reads the probability of event A given an event B is the same with probability of A since A and B are independent. Thus, P(A), P(B), P(A|B) and P(B|A) are the same since A and B are independent based on chisq statistic.

Answered By: jose_bacoy

My oppinion…

The “indenpedence test” has got a “wrong name”. Actually the test should be named “dependence test”, where:

H0: no dependence –> holds if p_value < treshold

H1: dependence –> holds if p_value > treshold

where threshold is the “level of significance”, usually alpha = 0.05

Therefore [[1, 50], [50, 50]] gives p-value close to 1, thereas random matrices gives p-value close to 0.

Answered By: slawek

First of all, according to the link
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
I think you made a mistake in using chi2_contingency.
This function computes the chi-square statistic and p-value for the hypothesis test of independence of the observed frequencies in the contingency table observed.

So, if you create the contingency table for your data,it has just one row and one column which does not make sense.

Finally, note that the chi-squared test is used for two categorical variables.

About the p-value, yes you are right.
If p-value is greater than 0.05, then we can not reject the Null hypothesis that the two arrays are independent of each other.

Answered By: yaser gholizade