pingouin.corr#
- pingouin.corr(x, y, alternative='two-sided', method='pearson', **kwargs)[source]#
(Robust) correlation between two variables.
- Parameters:
- x, yarray_like
First and second set of observations.
x
andy
must be independent.- alternativestring
Defines the alternative hypothesis, or tail of the correlation. Must be one of “two-sided” (default), “greater” or “less”. Both “greater” and “less” return a one-sided p-value. “greater” tests against the alternative hypothesis that the correlation is positive (greater than zero), “less” tests against the hypothesis that the correlation is negative.
- methodstring
Correlation type:
'pearson'
: Pearson \(r\) product-moment correlation'spearman'
: Spearman \(\rho\) rank-order correlation'kendall'
: Kendall’s \(\tau_B\) correlation (for ordinal data)'bicor'
: Biweight midcorrelation (robust)'percbend'
: Percentage bend correlation (robust)'shepherd'
: Shepherd’s pi correlation (robust)'skipped'
: Skipped correlation (robust)
- **kwargsoptional
Optional argument(s) passed to the lower-level correlation functions.
- Returns:
- stats
pandas.DataFrame
'n'
: Sample size (after removal of missing values)'outliers'
: number of outliers, only if a robust method was used'r'
: Correlation coefficient'CI95%'
: 95% parametric confidence intervals around \(r\)'p-val'
: p-value'BF10'
: Bayes Factor of the alternative hypothesis (only for Pearson correlation)'power'
: achieved power of the test with an alpha of 0.05.
- stats
See also
pairwise_corr
Pairwise correlation between columns of a pandas DataFrame
partial_corr
Partial correlation
rm_corr
Repeated measures correlation
Notes
The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Correlations of -1 or +1 imply a perfect negative and positive linear relationship, respectively, with 0 indicating the absence of association.
\[r_{xy} = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_i(x_i - \bar{x})^2} \sqrt{\sum_i(y_i - \bar{y})^2}} = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}\]where \(\text{cov}\) is the sample covariance and \(\sigma\) is the sample standard deviation.
If
method='pearson'
, The Bayes Factor is calculated using thepingouin.bayesfactor_pearson()
function.The Spearman correlation coefficient is a non-parametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Correlations of -1 or +1 imply an exact negative and positive monotonic relationship, respectively. Mathematically, the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables.
The Kendall correlation coefficient is a measure of the correspondence between two rankings. Values also range from -1 (perfect disagreement) to 1 (perfect agreement), with 0 indicating the absence of association. Consistent with
scipy.stats.kendalltau()
, Pingouin returns the Tau-b coefficient, which adjusts for ties:\[\tau_B = \frac{(P - Q)}{\sqrt{(P + Q + T) (P + Q + U)}}\]where \(P\) is the number of concordant pairs, \(Q\) the number of discordand pairs, \(T\) the number of ties in x, and \(U\) the number of ties in y.
The biweight midcorrelation and percentage bend correlation [1] are both robust methods that protects against univariate outliers by down-weighting observations that deviate too much from the median.
The Shepherd pi [2] correlation and skipped [3], [4] correlation are both robust methods that returns the Spearman correlation coefficient after removing bivariate outliers. Briefly, the Shepherd pi uses a bootstrapping of the Mahalanobis distance to identify outliers, while the skipped correlation is based on the minimum covariance determinant (which requires scikit-learn). Note that these two methods are significantly slower than the previous ones.
The confidence intervals for the correlation coefficient are estimated using the Fisher transformation.
Important
Rows with missing values (NaN) are automatically removed.
References
[1]Wilcox, R.R., 1994. The percentage bend correlation coefficient. Psychometrika 59, 601–616. https://doi.org/10.1007/BF02294395
[2]Schwarzkopf, D.S., De Haas, B., Rees, G., 2012. Better ways to improve standards in brain-behavior correlation analysis. Front. Hum. Neurosci. 6, 200. https://doi.org/10.3389/fnhum.2012.00200
[3]Rousselet, G.A., Pernet, C.R., 2012. Improving standards in brain-behavior correlation analyses. Front. Hum. Neurosci. 6, 119. https://doi.org/10.3389/fnhum.2012.00119
[4]Pernet, C.R., Wilcox, R., Rousselet, G.A., 2012. Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front. Psychol. 3, 606. https://doi.org/10.3389/fpsyg.2012.00606
Examples
Pearson correlation
>>> import numpy as np >>> import pingouin as pg >>> # Generate random correlated samples >>> np.random.seed(123) >>> mean, cov = [4, 6], [(1, .5), (.5, 1)] >>> x, y = np.random.multivariate_normal(mean, cov, 30).T >>> # Compute Pearson correlation >>> pg.corr(x, y).round(3) n r CI95% p-val BF10 power pearson 30 0.491 [0.16, 0.72] 0.006 8.55 0.809
Pearson correlation with two outliers
>>> x[3], y[5] = 12, -8 >>> pg.corr(x, y).round(3) n r CI95% p-val BF10 power pearson 30 0.147 [-0.23, 0.48] 0.439 0.302 0.121
Spearman correlation (robust to outliers)
>>> pg.corr(x, y, method="spearman").round(3) n r CI95% p-val power spearman 30 0.401 [0.05, 0.67] 0.028 0.61
Biweight midcorrelation (robust)
>>> pg.corr(x, y, method="bicor").round(3) n r CI95% p-val power bicor 30 0.393 [0.04, 0.66] 0.031 0.592
Percentage bend correlation (robust)
>>> pg.corr(x, y, method='percbend').round(3) n r CI95% p-val power percbend 30 0.389 [0.03, 0.66] 0.034 0.581
Shepherd’s pi correlation (robust)
>>> pg.corr(x, y, method='shepherd').round(3) n outliers r CI95% p-val power shepherd 30 2 0.437 [0.08, 0.7] 0.02 0.662
Skipped spearman correlation (robust)
>>> pg.corr(x, y, method='skipped').round(3) n outliers r CI95% p-val power skipped 30 2 0.437 [0.08, 0.7] 0.02 0.662
One-tailed Pearson correlation
>>> pg.corr(x, y, alternative="greater", method='pearson').round(3) n r CI95% p-val BF10 power pearson 30 0.147 [-0.17, 1.0] 0.22 0.467 0.194
>>> pg.corr(x, y, alternative="less", method='pearson').round(3) n r CI95% p-val BF10 power pearson 30 0.147 [-1.0, 0.43] 0.78 0.137 0.008
Perfect correlation
>>> pg.corr(x, -x).round(3) n r CI95% p-val BF10 power pearson 30 -1.0 [-1.0, -1.0] 0.0 inf 1
Using columns of a pandas dataframe
>>> import pandas as pd >>> data = pd.DataFrame({'x': x, 'y': y}) >>> pg.corr(data['x'], data['y']).round(3) n r CI95% p-val BF10 power pearson 30 0.147 [-0.23, 0.48] 0.439 0.302 0.121