pingouin.normality#
- pingouin.normality(data, dv=None, group=None, method='shapiro', alpha=0.05)[source]#
Univariate normality test.
- Parameters:
- data
pandas.DataFrame
, series, list or 1D np.array Iterable. Can be either a single list, 1D numpy array, or a wide- or long-format pandas dataframe.
- dvstr
Dependent variable (only when
data
is a long-format dataframe).- groupstr
Grouping variable (only when
data
is a long-format dataframe).- methodstr
Normality test. ‘shapiro’ (default) performs the Shapiro-Wilk test using
scipy.stats.shapiro()
, ‘normaltest’ performs the omnibus test of normality usingscipy.stats.normaltest()
, ‘jarque_bera’ performs the Jarque-Bera test usingscipy.stats.jarque_bera()
. The Omnibus and Jarque-Bera tests are more suitable than the Shapiro test for large samples.- alphafloat
Significance level.
- data
- Returns:
- stats
pandas.DataFrame
'W'
: Test statistic.'pval'
: p-value.'normal'
: True ifdata
is normally distributed.
- stats
See also
homoscedasticity
Test equality of variance.
sphericity
Mauchly’s test for sphericity.
Notes
The Shapiro-Wilk test calculates a \(W\) statistic that tests whether a random sample \(x_1, x_2, ..., x_n\) comes from a normal distribution.
The \(W\) statistic is calculated as follows:
\[W = \frac{(\sum_{i=1}^n a_i x_{i})^2} {\sum_{i=1}^n (x_i - \overline{x})^2}\]where the \(x_i\) are the ordered sample values (in ascending order) and the \(a_i\) are constants generated from the means, variances and covariances of the order statistics of a sample of size \(n\) from a standard normal distribution. Specifically:
\[(a_1, ..., a_n) = \frac{m^TV^{-1}}{(m^TV^{-1}V^{-1}m)^{1/2}}\]with \(m = (m_1, ..., m_n)^T\) and \((m_1, ..., m_n)\) are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and \(V\) is the covariance matrix of those order statistics.
The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
The result of the Shapiro-Wilk test should be interpreted with caution in the case of large sample sizes. Indeed, quoting from Wikipedia:
“Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case.”
Note that missing values are automatically removed (casewise deletion).
References
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4), 591-611.
https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
Examples
Shapiro-Wilk test on a 1D array.
>>> import numpy as np >>> import pingouin as pg >>> np.random.seed(123) >>> x = np.random.normal(size=100) >>> pg.normality(x) W pval normal 0 0.98414 0.274886 True
Omnibus test on a wide-format dataframe with missing values
>>> data = pg.read_dataset('mediation') >>> data.loc[1, 'X'] = np.nan >>> pg.normality(data, method='normaltest').round(3) W pval normal X 1.792 0.408 True M 0.492 0.782 True Y 0.349 0.840 True Mbin 839.716 0.000 False Ybin 814.468 0.000 False W1 24.816 0.000 False W2 43.400 0.000 False
Pandas Series
>>> pg.normality(data['X'], method='normaltest') W pval normal X 1.791839 0.408232 True
Long-format dataframe
>>> data = pg.read_dataset('rm_anova2') >>> pg.normality(data, dv='Performance', group='Time') W pval normal Time Pre 0.967718 0.478773 True Post 0.940728 0.095157 True
Same but using the Jarque-Bera test
>>> pg.normality(data, dv='Performance', group='Time', method="jarque_bera") W pval normal Time Pre 0.304021 0.858979 True Post 1.265656 0.531088 True