pingouin.anova#
- pingouin.anova(data=None, dv=None, between=None, ss_type=2, detailed=False, effsize='np2')[source]#
One-way and N-way ANOVA.
- Parameters:
- data
pandas.DataFrame
DataFrame. Note that this function can also directly be used as a Pandas method, in which case this argument is no longer needed.
- dvstring
Name of column in
data
containing the dependent variable.- betweenstring or list with N elements
Name of column(s) in
data
containing the between-subject factor(s). Ifbetween
is a single string, a one-way ANOVA is computed. Ifbetween
is a list with two or more elements, a N-way ANOVA is performed. Note that Pingouin will internally call statsmodels to calculate ANOVA with 3 or more factors, or unbalanced two-way ANOVA.- ss_typeint
Specify how the sums of squares is calculated for unbalanced design with 2 or more factors. Can be 1, 2 (default), or 3. This has no impact on one-way design or N-way ANOVA with balanced data.
- detailedboolean
If True, return a detailed ANOVA table (default True for N-way ANOVA).
- effsizestr
Effect size. Must be ‘np2’ (partial eta-squared) or ‘n2’ (eta-squared). Note that for one-way ANOVA partial eta-squared is the same as eta-squared.
- data
- Returns:
- aov
pandas.DataFrame
ANOVA summary:
'Source'
: Factor names'SS'
: Sums of squares'DF'
: Degrees of freedom'MS'
: Mean squares'F'
: F-values'p-unc'
: uncorrected p-values'np2'
: Partial eta-square effect sizes
- aov
See also
rm_anova
One-way and two-way repeated measures ANOVA
mixed_anova
Two way mixed ANOVA
welch_anova
One-way Welch ANOVA
kruskal
Non-parametric one-way ANOVA
Notes
The classic ANOVA is very powerful when the groups are normally distributed and have equal variances. However, when the groups have unequal variances, it is best to use the Welch ANOVA (
pingouin.welch_anova()
) that better controls for type I error (Liu 2015). The homogeneity of variances can be measured with thepingouin.homoscedasticity()
function.The main idea of ANOVA is to partition the variance (sums of squares) into several components. For example, in one-way ANOVA:
\[ \begin{align}\begin{aligned}SS_{\text{total}} = SS_{\text{effect}} + SS_{\text{error}}\\SS_{\text{total}} = \sum_i \sum_j (Y_{ij} - \overline{Y})^2\\SS_{\text{effect}} = \sum_i n_i (\overline{Y_i} - \overline{Y})^2\\SS_{\text{error}} = \sum_i \sum_j (Y_{ij} - \overline{Y}_i)^2\end{aligned}\end{align} \]where \(i=1,...,r; j=1,...,n_i\), \(r\) is the number of groups, and \(n_i\) the number of observations for the \(i\) th group.
The F-statistics is then defined as:
\[F^* = \frac{MS_{\text{effect}}}{MS_{\text{error}}} = \frac{SS_{\text{effect}} / (r - 1)}{SS_{\text{error}} / (n_t - r)}\]and the p-value can be calculated using a F-distribution with \(r-1, n_t-1\) degrees of freedom.
When the groups are balanced and have equal variances, the optimal post-hoc test is the Tukey-HSD test (
pingouin.pairwise_tukey()
). If the groups have unequal variances, the Games-Howell test is more adequate (pingouin.pairwise_gameshowell()
).The default effect size reported in Pingouin is the partial eta-square, which, for one-way ANOVA is the same as eta-square and generalized eta-square.
\[\eta_p^2 = \frac{SS_{\text{effect}}}{SS_{\text{effect}} + SS_{\text{error}}}\]Missing values are automatically removed. Results have been tested against R, Matlab and JASP.
Examples
One-way ANOVA
>>> import pingouin as pg >>> df = pg.read_dataset('anova') >>> aov = pg.anova(dv='Pain threshold', between='Hair color', data=df, ... detailed=True) >>> aov.round(3) Source SS DF MS F p-unc np2 0 Hair color 1360.726 3 453.575 6.791 0.004 0.576 1 Within 1001.800 15 66.787 NaN NaN NaN
Same but using a standard eta-squared instead of a partial eta-squared effect size. Also note how here we’re using the anova function directly as a method (= built-in function) of our pandas dataframe. In that case, we don’t have to specify
data
anymore.>>> df.anova(dv='Pain threshold', between='Hair color', detailed=False, ... effsize='n2') Source ddof1 ddof2 F p-unc n2 0 Hair color 3 15 6.791407 0.004114 0.575962
Two-way ANOVA with balanced design
>>> data = pg.read_dataset('anova2') >>> data.anova(dv="Yield", between=["Blend", "Crop"]).round(3) Source SS DF MS F p-unc np2 0 Blend 2.042 1 2.042 0.004 0.952 0.000 1 Crop 2736.583 2 1368.292 2.525 0.108 0.219 2 Blend * Crop 2360.083 2 1180.042 2.178 0.142 0.195 3 Residual 9753.250 18 541.847 NaN NaN NaN
Two-way ANOVA with unbalanced design (requires statsmodels)
>>> data = pg.read_dataset('anova2_unbalanced') >>> data.anova(dv="Scores", between=["Diet", "Exercise"], ... effsize="n2").round(3) Source SS DF MS F p-unc n2 0 Diet 390.625 1.0 390.625 7.423 0.034 0.433 1 Exercise 180.625 1.0 180.625 3.432 0.113 0.200 2 Diet * Exercise 15.625 1.0 15.625 0.297 0.605 0.017 3 Residual 315.750 6.0 52.625 NaN NaN NaN
Three-way ANOVA, type 3 sums of squares (requires statsmodels)
>>> data = pg.read_dataset('anova3') >>> data.anova(dv='Cholesterol', between=['Sex', 'Risk', 'Drug'], ... ss_type=3).round(3) Source SS DF MS F p-unc np2 0 Sex 2.075 1.0 2.075 2.462 0.123 0.049 1 Risk 11.332 1.0 11.332 13.449 0.001 0.219 2 Drug 0.816 2.0 0.408 0.484 0.619 0.020 3 Sex * Risk 0.117 1.0 0.117 0.139 0.711 0.003 4 Sex * Drug 2.564 2.0 1.282 1.522 0.229 0.060 5 Risk * Drug 2.438 2.0 1.219 1.446 0.245 0.057 6 Sex * Risk * Drug 1.844 2.0 0.922 1.094 0.343 0.044 7 Residual 40.445 48.0 0.843 NaN NaN NaN