pingouin.anova#

pingouin.anova(data=None, dv=None, between=None, ss_type=2, detailed=False, effsize='np2')[source]#

One-way and N-way ANOVA.

Parameters:
datapandas.DataFrame

DataFrame. Note that this function can also directly be used as a Pandas method, in which case this argument is no longer needed.

dvstring

Name of column in data containing the dependent variable.

betweenstring or list with N elements

Name of column(s) in data containing the between-subject factor(s). If between is a single string, a one-way ANOVA is computed. If between is a list with two or more elements, a N-way ANOVA is performed. Note that Pingouin will internally call statsmodels to calculate ANOVA with 3 or more factors, or unbalanced two-way ANOVA.

ss_typeint

Specify how the sums of squares is calculated for unbalanced design with 2 or more factors. Can be 1, 2 (default), or 3. This has no impact on one-way design or N-way ANOVA with balanced data.

detailedboolean

If True, return a detailed ANOVA table (default True for N-way ANOVA).

effsizestr

Effect size. Must be ‘np2’ (partial eta-squared) or ‘n2’ (eta-squared). Note that for one-way ANOVA partial eta-squared is the same as eta-squared.

Returns:
aovpandas.DataFrame

ANOVA summary:

  • 'Source': Factor names

  • 'SS': Sums of squares

  • 'DF': Degrees of freedom

  • 'MS': Mean squares

  • 'F': F-values

  • 'p-unc': uncorrected p-values

  • 'np2': Partial eta-square effect sizes

See also

rm_anova

One-way and two-way repeated measures ANOVA

mixed_anova

Two way mixed ANOVA

welch_anova

One-way Welch ANOVA

kruskal

Non-parametric one-way ANOVA

Notes

The classic ANOVA is very powerful when the groups are normally distributed and have equal variances. However, when the groups have unequal variances, it is best to use the Welch ANOVA (pingouin.welch_anova()) that better controls for type I error (Liu 2015). The homogeneity of variances can be measured with the pingouin.homoscedasticity() function.

The main idea of ANOVA is to partition the variance (sums of squares) into several components. For example, in one-way ANOVA:

\[ \begin{align}\begin{aligned}SS_{\text{total}} = SS_{\text{effect}} + SS_{\text{error}}\\SS_{\text{total}} = \sum_i \sum_j (Y_{ij} - \overline{Y})^2\\SS_{\text{effect}} = \sum_i n_i (\overline{Y_i} - \overline{Y})^2\\SS_{\text{error}} = \sum_i \sum_j (Y_{ij} - \overline{Y}_i)^2\end{aligned}\end{align} \]

where \(i=1,...,r; j=1,...,n_i\), \(r\) is the number of groups, and \(n_i\) the number of observations for the \(i\) th group.

The F-statistics is then defined as:

\[F^* = \frac{MS_{\text{effect}}}{MS_{\text{error}}} = \frac{SS_{\text{effect}} / (r - 1)}{SS_{\text{error}} / (n_t - r)}\]

and the p-value can be calculated using a F-distribution with \(r-1, n_t-1\) degrees of freedom.

When the groups are balanced and have equal variances, the optimal post-hoc test is the Tukey-HSD test (pingouin.pairwise_tukey()). If the groups have unequal variances, the Games-Howell test is more adequate (pingouin.pairwise_gameshowell()).

The default effect size reported in Pingouin is the partial eta-square, which, for one-way ANOVA is the same as eta-square and generalized eta-square.

\[\eta_p^2 = \frac{SS_{\text{effect}}}{SS_{\text{effect}} + SS_{\text{error}}}\]

Missing values are automatically removed. Results have been tested against R, Matlab and JASP.

Examples

One-way ANOVA

>>> import pingouin as pg
>>> df = pg.read_dataset('anova')
>>> aov = pg.anova(dv='Pain threshold', between='Hair color', data=df,
...                detailed=True)
>>> aov.round(3)
       Source        SS  DF       MS      F  p-unc    np2
0  Hair color  1360.726   3  453.575  6.791  0.004  0.576
1      Within  1001.800  15   66.787    NaN    NaN    NaN

Same but using a standard eta-squared instead of a partial eta-squared effect size. Also note how here we’re using the anova function directly as a method (= built-in function) of our pandas dataframe. In that case, we don’t have to specify data anymore.

>>> df.anova(dv='Pain threshold', between='Hair color', detailed=False,
...          effsize='n2')
       Source  ddof1  ddof2         F     p-unc        n2
0  Hair color      3     15  6.791407  0.004114  0.575962

Two-way ANOVA with balanced design

>>> data = pg.read_dataset('anova2')
>>> data.anova(dv="Yield", between=["Blend", "Crop"]).round(3)
         Source        SS  DF        MS      F  p-unc    np2
0         Blend     2.042   1     2.042  0.004  0.952  0.000
1          Crop  2736.583   2  1368.292  2.525  0.108  0.219
2  Blend * Crop  2360.083   2  1180.042  2.178  0.142  0.195
3      Residual  9753.250  18   541.847    NaN    NaN    NaN

Two-way ANOVA with unbalanced design (requires statsmodels)

>>> data = pg.read_dataset('anova2_unbalanced')
>>> data.anova(dv="Scores", between=["Diet", "Exercise"],
...            effsize="n2").round(3)
            Source       SS   DF       MS      F  p-unc     n2
0             Diet  390.625  1.0  390.625  7.423  0.034  0.433
1         Exercise  180.625  1.0  180.625  3.432  0.113  0.200
2  Diet * Exercise   15.625  1.0   15.625  0.297  0.605  0.017
3         Residual  315.750  6.0   52.625    NaN    NaN    NaN

Three-way ANOVA, type 3 sums of squares (requires statsmodels)

>>> data = pg.read_dataset('anova3')
>>> data.anova(dv='Cholesterol', between=['Sex', 'Risk', 'Drug'],
...            ss_type=3).round(3)
              Source      SS    DF      MS       F  p-unc    np2
0                Sex   2.075   1.0   2.075   2.462  0.123  0.049
1               Risk  11.332   1.0  11.332  13.449  0.001  0.219
2               Drug   0.816   2.0   0.408   0.484  0.619  0.020
3         Sex * Risk   0.117   1.0   0.117   0.139  0.711  0.003
4         Sex * Drug   2.564   2.0   1.282   1.522  0.229  0.060
5        Risk * Drug   2.438   2.0   1.219   1.446  0.245  0.057
6  Sex * Risk * Drug   1.844   2.0   0.922   1.094  0.343  0.044
7           Residual  40.445  48.0   0.843     NaN    NaN    NaN