pingouin.corr#

pingouin.corr(x, y, alternative='two-sided', method='pearson', **kwargs)[source]#

(Robust) correlation between two variables.

Parameters:

x, yarray_like

First and second set of observations. x and y must be independent.

alternativestring

Defines the alternative hypothesis, or tail of the correlation. Must be one of “two-sided” (default), “greater” or “less”. Both “greater” and “less” return a one-sided p-value. “greater” tests against the alternative hypothesis that the correlation is positive (greater than zero), “less” tests against the hypothesis that the correlation is negative.

methodstring

Correlation type:

'pearson': Pearson \(r\) product-moment correlation
'spearman': Spearman \(\rho\) rank-order correlation
'kendall': Kendall’s \(\tau_B\) correlation (for ordinal data)
'bicor': Biweight midcorrelation (robust)
'percbend': Percentage bend correlation (robust)
'shepherd': Shepherd’s pi correlation (robust)
'skipped': Skipped correlation (robust)

**kwargsoptional

Optional argument(s) passed to the lower-level correlation functions.

Returns:

statspandas.DataFrame

'n': Sample size (after removal of missing values)
'outliers': number of outliers, only if a robust method was used
'r': Correlation coefficient
'CI95%': 95% parametric confidence intervals around \(r\)
'p-val': p-value
'BF10': Bayes Factor of the alternative hypothesis (only for Pearson correlation)
'power': achieved power of the test with an alpha of 0.05.

See also

pairwise_corr: Pairwise correlation between columns of a pandas DataFrame
partial_corr: Partial correlation
rm_corr: Repeated measures correlation

Notes

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Correlations of -1 or +1 imply a perfect negative and positive linear relationship, respectively, with 0 indicating the absence of association.

\[r_{xy} = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_i(x_i - \bar{x})^2} \sqrt{\sum_i(y_i - \bar{y})^2}} = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}\]

where \(\text{cov}\) is the sample covariance and \(\sigma\) is the sample standard deviation.

If method='pearson', The Bayes Factor is calculated using the pingouin.bayesfactor_pearson() function.

The Spearman correlation coefficient is a non-parametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Correlations of -1 or +1 imply an exact negative and positive monotonic relationship, respectively. Mathematically, the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables.

The Kendall correlation coefficient is a measure of the correspondence between two rankings. Values also range from -1 (perfect disagreement) to 1 (perfect agreement), with 0 indicating the absence of association. Consistent with scipy.stats.kendalltau(), Pingouin returns the Tau-b coefficient, which adjusts for ties:

\[\tau_B = \frac{(P - Q)}{\sqrt{(P + Q + T) (P + Q + U)}}\]

where \(P\) is the number of concordant pairs, \(Q\) the number of discordand pairs, \(T\) the number of ties in x, and \(U\) the number of ties in y.

The biweight midcorrelation and percentage bend correlation [1] are both robust methods that protects against univariate outliers by down-weighting observations that deviate too much from the median.

The Shepherd pi [2] correlation and skipped [3], [4] correlation are both robust methods that returns the Spearman correlation coefficient after removing bivariate outliers. Briefly, the Shepherd pi uses a bootstrapping of the Mahalanobis distance to identify outliers, while the skipped correlation is based on the minimum covariance determinant (which requires scikit-learn). Note that these two methods are significantly slower than the previous ones.

The confidence intervals for the correlation coefficient are estimated using the Fisher transformation.

Important

Rows with missing values (NaN) are automatically removed.

References

[1]

Wilcox, R.R., 1994. The percentage bend correlation coefficient. Psychometrika 59, 601–616. https://doi.org/10.1007/BF02294395

[2]

Schwarzkopf, D.S., De Haas, B., Rees, G., 2012. Better ways to improve standards in brain-behavior correlation analysis. Front. Hum. Neurosci. 6, 200. https://doi.org/10.3389/fnhum.2012.00200

[3]

Rousselet, G.A., Pernet, C.R., 2012. Improving standards in brain-behavior correlation analyses. Front. Hum. Neurosci. 6, 119. https://doi.org/10.3389/fnhum.2012.00119

[4]

Pernet, C.R., Wilcox, R., Rousselet, G.A., 2012. Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front. Psychol. 3, 606. https://doi.org/10.3389/fpsyg.2012.00606

Examples

Pearson correlation

>>> import numpy as np
>>> import pingouin as pg
>>> # Generate random correlated samples
>>> np.random.seed(123)
>>> mean, cov = [4, 6], [(1, .5), (.5, 1)]
>>> x, y = np.random.multivariate_normal(mean, cov, 30).T
>>> # Compute Pearson correlation
>>> pg.corr(x, y).round(3)
          n      r         CI95%  p-val  BF10  power
pearson  30  0.491  [0.16, 0.72]  0.006  8.55  0.809

Pearson correlation with two outliers

>>> x[3], y[5] = 12, -8
>>> pg.corr(x, y).round(3)
          n      r          CI95%  p-val   BF10  power
pearson  30  0.147  [-0.23, 0.48]  0.439  0.302  0.121

Spearman correlation (robust to outliers)

>>> pg.corr(x, y, method="spearman").round(3)
           n      r         CI95%  p-val  power
spearman  30  0.401  [0.05, 0.67]  0.028   0.61

Biweight midcorrelation (robust)

>>> pg.corr(x, y, method="bicor").round(3)
        n      r         CI95%  p-val  power
bicor  30  0.393  [0.04, 0.66]  0.031  0.592

Percentage bend correlation (robust)

>>> pg.corr(x, y, method='percbend').round(3)
           n      r         CI95%  p-val  power
percbend  30  0.389  [0.03, 0.66]  0.034  0.581

Shepherd’s pi correlation (robust)

>>> pg.corr(x, y, method='shepherd').round(3)
           n  outliers      r        CI95%  p-val  power
shepherd  30         2  0.437  [0.08, 0.7]   0.02  0.662

Skipped spearman correlation (robust)

>>> pg.corr(x, y, method='skipped').round(3)
          n  outliers      r        CI95%  p-val  power
skipped  30         2  0.437  [0.08, 0.7]   0.02  0.662

One-tailed Pearson correlation

>>> pg.corr(x, y, alternative="greater", method='pearson').round(3)
        n      r           CI95%  p-val   BF10  power
pearson  30  0.147  [-0.17, 1.0]   0.22  0.467  0.194

>>> pg.corr(x, y, alternative="less", method='pearson').round(3)
        n        r         CI95%  p-val   BF10  power
pearson  30  0.147  [-1.0, 0.43]   0.78  0.137  0.008

Perfect correlation

>>> pg.corr(x, -x).round(3)
          n    r         CI95%  p-val BF10  power
pearson  30 -1.0  [-1.0, -1.0]    0.0  inf      1

Using columns of a pandas dataframe

>>> import pandas as pd
>>> data = pd.DataFrame({'x': x, 'y': y})
>>> pg.corr(data['x'], data['y']).round(3)
          n      r          CI95%  p-val   BF10  power
pearson  30  0.147  [-0.23, 0.48]  0.439  0.302  0.121