pingouin.chi2_independence#
- pingouin.chi2_independence(data, x, y, correction=True)[source]#
Chi-squared independence tests between two categorical variables.
The test is computed for different values of \(\lambda\): 1, 2/3, 0, -1/2, -1 and -2 (Cressie and Read, 1984).
- Parameters:
- data
pandas.DataFrame
The dataframe containing the ocurrences for the test.
- x, ystring
The variables names for the Chi-squared test. Must be names of columns in
data
.- correctionbool
Whether to apply Yates’ correction when the degree of freedom of the observed contingency table is 1 (Yates 1934).
- data
- Returns:
- expected
pandas.DataFrame
The expected contingency table of frequencies.
- observed
pandas.DataFrame
The (corrected or not) observed contingency table of frequencies.
- stats
pandas.DataFrame
The test summary, containing four columns:
'test'
: The statistic name'lambda'
: The \(\lambda\) value used for the power divergence statistic'chi2'
: The test statistic'pval'
: The p-value of the test'cramer'
: The Cramer’s V effect size'power'
: The statistical power of the test
- expected
Notes
From Wikipedia:
The chi-squared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
As application examples, this test can be used to i) evaluate the quality of a categorical variable in a classification problem or to ii) check the similarity between two categorical variables. In the first example, a good categorical predictor and the class column should present high \(\chi^2\) and low p-value. In the second example, similar categorical variables should present low \(\chi^2\) and high p-value.
This function is a wrapper around the
scipy.stats.power_divergence()
function.Warning
As a general guideline for the consistency of this test, the observed and the expected contingency tables should not have cells with frequencies lower than 5.
References
Cressie, N., & Read, T. R. (1984). Multinomial goodness‐of‐fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440-464.
Yates, F. (1934). Contingency Tables Involving Small Numbers and the \(\chi^2\) Test. Supplement to the Journal of the Royal Statistical Society, 1, 217-235.
Examples
Let’s see if gender is a good categorical predictor for the presence of heart disease.
>>> import pingouin as pg >>> data = pg.read_dataset('chi2_independence') >>> data['sex'].value_counts(ascending=True) sex 0 96 1 207 Name: count, dtype: int64
If gender is not a good predictor for heart disease, we should expect the same 96:207 ratio across the target classes.
>>> expected, observed, stats = pg.chi2_independence(data, x='sex', ... y='target') >>> expected target 0 1 sex 0 43.722772 52.277228 1 94.277228 112.722772
Let’s see what the data tells us.
>>> observed target 0 1 sex 0 24.5 71.5 1 113.5 93.5
The proportion is lower on the class 0 and higher on the class 1. The tests should be sensitive to this difference.
>>> stats.round(3) test lambda chi2 dof pval cramer power 0 pearson 1.000 22.717 1.0 0.0 0.274 0.997 1 cressie-read 0.667 22.931 1.0 0.0 0.275 0.998 2 log-likelihood 0.000 23.557 1.0 0.0 0.279 0.998 3 freeman-tukey -0.500 24.220 1.0 0.0 0.283 0.998 4 mod-log-likelihood -1.000 25.071 1.0 0.0 0.288 0.999 5 neyman -2.000 27.458 1.0 0.0 0.301 0.999
Very low p-values indeed. The gender qualifies as a good predictor for the presence of heart disease on this dataset.