ANOVA from OLS model with Dataframe returning incorrect degrees of freedom (and other values)

Question:

I used an example dataset which I load into a dataframe. I then use a statsmodels OLS comparing Texture as a result of Mix and then use that model for an ANOVA table.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('contrastExampleData.csv')

mod = ols(formula = 'Texture ~ Mix', data = df).fit()
aov_table = sm.stats.anova_lm(mod, typ = 1)
print(aov_table)

If it’s preferred that I upload the csv and link it, please let me know.
The dataframe:

    Mix  Blend Flour  SPI  Texture
0     1    0.5   KSS  1.1    107.3
1     1    0.5   KSS  1.1    110.1
2     1    0.5   KSS  1.1    112.6
3     2    0.5   KSS  2.2     97.9
4     2    0.5   KSS  2.2    100.1
5     2    0.5   KSS  2.2    102.0
6     3    0.5   KSS  3.3     86.8
7     3    0.5   KSS  3.3     88.1
8     3    0.5   KSS  3.3     89.1
9     4    0.5   KNC  1.1    108.1
10    4    0.5   KNC  1.1    110.1
11    4    0.5   KNC  1.1    111.8
12    5    0.5   KNC  2.2    108.6
13    5    0.5   KNC  2.2    110.2
14    5    0.5   KNC  2.2    111.2
15    6    0.5   KNC  3.3     95.0
16    6    0.5   KNC  3.3     95.4
17    6    0.5   KNC  3.3     95.5
18    7    1.0   KSS  1.1     97.3
19    7    1.0   KSS  1.1     99.1
20    7    1.0   KSS  1.1    100.6
21    8    1.0   KSS  2.2     92.8
22    8    1.0   KSS  2.2     94.6
23    8    1.0   KSS  2.2     96.7
24    9    1.0   KSS  3.3     86.8
25    9    1.0   KSS  3.3     88.1
26    9    1.0   KSS  3.3     89.1
27   10    1.0   KNC  1.1     94.1
28   10    1.0   KNC  1.1     96.1
29   10    1.0   KNC  1.1     97.8
30   11    1.0   KNC  2.2     95.7
31   11    1.0   KNC  2.2     97.6
32   11    1.0   KNC  2.2     99.8
33   12    1.0   KNC  3.3     90.2
34   12    1.0   KNC  3.3     92.1
35   12    1.0   KNC  3.3     93.7

Resulting in output:


            df       sum_sq     mean_sq          F    PR(>F)
Mix        1.0   520.080472  520.080472  10.828726  0.002334
Residual  34.0  1632.947028   48.027854        NaN       NaN

However, this is entirely incorrect – the correct ANOVA table can be seen here. At first notice, the degrees of freedom should be 11 instead of 1, given that there are 12 levels to Mix, but I cannot figure out why this has happened. I’ve done similar analyses with simpler datasets of only two columns and haven’t had an issue. I’ve attempted to use sm.OLS and others but haven’t had much luck. What is the issue that is resulting in an incorrect ANOVA?

Asked By: Farooq Zahid

||

Answers:

This question is effectively answered by this R question, as statsmodels uses R type formulae. I found this just after posting and wanted to update for others with similar questions for python.

The solution is to convert the independent variable to a categorical variable instead of a numeric variable, as the "Mix" in this is not a continuous numerical variable, but instead 12 discrete labels. This is done by:

mod = ols(formula = 'Texture ~ C(Mix)', data = df).fit()

which results in the correct ANOVA table:

C(Mix)    11.0  2080.2875  189.117045  62.397705  6.550053e-15
Residual  24.0    72.7400    3.030833        NaN           NaN
Answered By: Farooq Zahid