pingouin.normality#

pingouin.normality(data, dv=None, group=None, method='shapiro', alpha=0.05)[source]#

Univariate normality test.

Parameters:
datapandas.DataFrame, series, list or 1D np.array

Iterable. Can be either a single list, 1D numpy array, or a wide- or long-format pandas dataframe.

dvstr

Dependent variable (only when data is a long-format dataframe).

groupstr

Grouping variable (only when data is a long-format dataframe).

methodstr

Normality test. ‘shapiro’ (default) performs the Shapiro-Wilk test using scipy.stats.shapiro(), ‘normaltest’ performs the omnibus test of normality using scipy.stats.normaltest(), ‘jarque_bera’ performs the Jarque-Bera test using scipy.stats.jarque_bera(). The Omnibus and Jarque-Bera tests are more suitable than the Shapiro test for large samples.

alphafloat

Significance level.

Returns:
statspandas.DataFrame
  • 'W': Test statistic.

  • 'pval': p-value.

  • 'normal': True if data is normally distributed.

See also

homoscedasticity

Test equality of variance.

sphericity

Mauchly’s test for sphericity.

Notes

The Shapiro-Wilk test calculates a \(W\) statistic that tests whether a random sample \(x_1, x_2, ..., x_n\) comes from a normal distribution.

The \(W\) statistic is calculated as follows:

\[W = \frac{(\sum_{i=1}^n a_i x_{i})^2} {\sum_{i=1}^n (x_i - \overline{x})^2}\]

where the \(x_i\) are the ordered sample values (in ascending order) and the \(a_i\) are constants generated from the means, variances and covariances of the order statistics of a sample of size \(n\) from a standard normal distribution. Specifically:

\[(a_1, ..., a_n) = \frac{m^TV^{-1}}{(m^TV^{-1}V^{-1}m)^{1/2}}\]

with \(m = (m_1, ..., m_n)^T\) and \((m_1, ..., m_n)\) are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and \(V\) is the covariance matrix of those order statistics.

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

The result of the Shapiro-Wilk test should be interpreted with caution in the case of large sample sizes. Indeed, quoting from Wikipedia:

“Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case.”

Note that missing values are automatically removed (casewise deletion).

References

Examples

  1. Shapiro-Wilk test on a 1D array.

>>> import numpy as np
>>> import pingouin as pg
>>> np.random.seed(123)
>>> x = np.random.normal(size=100)
>>> pg.normality(x)
         W      pval  normal
0  0.98414  0.274886    True
  1. Omnibus test on a wide-format dataframe with missing values

>>> data = pg.read_dataset('mediation')
>>> data.loc[1, 'X'] = np.nan
>>> pg.normality(data, method='normaltest').round(3)
            W   pval  normal
X       1.792  0.408    True
M       0.492  0.782    True
Y       0.349  0.840    True
Mbin  839.716  0.000   False
Ybin  814.468  0.000   False
W1     24.816  0.000   False
W2     43.400  0.000   False
  1. Pandas Series

>>> pg.normality(data['X'], method='normaltest')
          W      pval  normal
X  1.791839  0.408232    True
  1. Long-format dataframe

>>> data = pg.read_dataset('rm_anova2')
>>> pg.normality(data, dv='Performance', group='Time')
             W      pval  normal
Time
Pre   0.967718  0.478773    True
Post  0.940728  0.095157    True
  1. Same but using the Jarque-Bera test

>>> pg.normality(data, dv='Performance', group='Time', method="jarque_bera")
             W      pval  normal
Time
Pre   0.304021  0.858979    True
Post  1.265656  0.531088    True