pingouin.linear_regression#
- pingouin.linear_regression(X, y, add_intercept=True, weights=None, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False, relimp=False)[source]#
(Multiple) Linear regression.
- Parameters:
- Xarray_like
Predictor(s), of shape (n_samples, n_features) or (n_samples).
- yarray_like
Dependent variable, of shape (n_samples).
- add_interceptbool
If False, assume that the data are already centered. If True, add a constant term to the model. In this case, the first value in the output dict is the intercept of the model.
Note
It is generally recommended to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. The intercept coefficient and p-values are however rarely meaningful.
- weightsarray_like
An optional vector of sample weights to be used in the fitting process, of shape (n_samples). Missing or negative weights are not allowed. If not null, a weighted least squares is calculated.
Added in version 0.3.5.
- coef_onlybool
If True, return only the regression coefficients.
- alphafloat
Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1 - \alpha / 2]\)
- as_dataframebool
If True, returns a pandas DataFrame. If False, returns a dictionnary.
- remove_nabool
If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.
- relimpbool
If True, returns the relative importance (= contribution) of predictors. This is irrelevant when the predictors are uncorrelated: the total \(R^2\) of the model is simply the sum of each univariate regression \(R^2\)-values. However, this does not apply when predictors are correlated. Instead, the total \(R^2\) of the model is partitioned by averaging over all combinations of predictors, as done in the relaimpo R package (
calc.relimp(type="lmg")
).Warning
The computation time roughly doubles for each additional predictor and therefore this can be extremely slow for models with more than 12-15 predictors.
Added in version 0.3.0.
- Returns:
- stats
pandas.DataFrame
or dict Linear regression summary:
'names'
: name of variable(s) in the model (e.g. x1, x2…)'coef'
: regression coefficients'se'
: standard errors'T'
: T-values'pval'
: p-values'r2'
: coefficient of determination (\(R^2\))'adj_r2'
: adjusted \(R^2\)'CI[2.5%]'
: lower confidence intervals'CI[97.5%]'
: upper confidence intervals'relimp'
: relative contribution of each predictor to the final \(R^2\) (only ifrelimp=True
).'relimp_perc'
: percent relative contribution
In addition, the output dataframe comes with hidden attributes such as the residuals, and degrees of freedom of the model and residuals, which can be accessed as follow, respectively:
>>> lm = pg.linear_regression() >>> lm.residuals_, lm.df_model_, lm.df_resid_
Note that to follow scikit-learn convention, these hidden atributes end with an “_”. When
as_dataframe=False
however, these attributes are no longer hidden and can be accessed as any other keys in the output dictionary.>>> lm = pg.linear_regression() >>> lm['residuals'], lm['df_model'], lm['df_resid']
When
as_dataframe=False
the dictionary also contains the processedX
andy
arrays (i.e, with NaNs removed ifremove_na=True
) and the model’s predicted valuespred
.>>> lm['X'], lm['y'], lm['pred']
For a weighted least squares fit, the weighted
Xw
andyw
arrays are included in the dictionary.>>> lm['Xw'], lm['yw']
- stats
See also
Notes
The \(\beta\) coefficients are estimated using an ordinary least squares (OLS) regression, as implemented in the
scipy.linalg.lstsq()
function. The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated \(\beta\):\[\hat{\beta} = (X^TX)^{-1} X^Ty\]It is generally recommended to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.
The standard error of the estimates is a measure of the accuracy of the prediction defined as:
\[\sigma = \sqrt{\text{MSE} \cdot (X^TX)^{-1}}\]where \(\text{MSE}\) is the mean squared error,
\[\text{MSE} = \frac{SS_{\text{resid}}}{n - p - 1} = \frac{\sum{(\text{true} - \text{pred})^2}}{n - p - 1}\]\(p\) is the total number of predictor variables in the model (excluding the intercept) and \(n\) is the sample size.
Using the \(\beta\) coefficients and the standard errors, the T-values can be obtained:
\[T = \frac{\beta}{\sigma}\]and the p-values approximated using a T-distribution with \(n - p - 1\) degrees of freedom.
The coefficient of determination (\(R^2\)) is defined as:
\[R^2 = 1 - (\frac{SS_{\text{resid}}}{SS_{\text{total}}})\]The adjusted \(R^2\) is defined as:
\[\overline{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\]The relative importance (
relimp
) column is a partitioning of the total \(R^2\) of the model into individual \(R^2\) contribution. This is calculated by taking the average over average contributions in models of different sizes. For more details, please refer to Groemping et al. 2006 and the R package relaimpo.Note that Pingouin will automatically remove any duplicate columns from \(X\), as well as any column with only one unique value (constant), excluding the intercept.
Results have been compared against sklearn, R, statsmodels and JASP.
Examples
Simple linear regression using columns of a pandas dataframe
In this first example, we’ll use the tips dataset to see how well we can predict the waiter’s tip (in dollars) based on the total bill (also in dollars).
>>> import numpy as np >>> import pingouin as pg >>> df = pg.read_dataset('tips') >>> # Let's predict the tip ($) based on the total bill (also in $) >>> lm = pg.linear_regression(df['total_bill'], df['tip']) >>> lm.round(2) names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%] 0 Intercept 0.92 0.16 5.76 0.0 0.46 0.45 0.61 1.23 1 total_bill 0.11 0.01 14.26 0.0 0.46 0.45 0.09 0.12
It comes as no surprise that total bill is indeed a significant predictor of the waiter’s tip (T=14.26, p<0.05). The \(R^2\) of the model is 0.46 and the adjusted \(R^2\) is 0.45, which means that our model roughly explains ~45% of the total variance in the tip amount.
Multiple linear regression
We can also have more than one predictor and run a multiple linear regression. Below, we add the party size as a second predictor of tip.
>>> # We'll add a second predictor: the party size >>> lm = pg.linear_regression(df[['total_bill', 'size']], df['tip']) >>> lm.round(2) names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%] 0 Intercept 0.67 0.19 3.46 0.00 0.47 0.46 0.29 1.05 1 total_bill 0.09 0.01 10.17 0.00 0.47 0.46 0.07 0.11 2 size 0.19 0.09 2.26 0.02 0.47 0.46 0.02 0.36
The party size is also a significant predictor of tip (T=2.26, p=0.02). Note that adding this new predictor however only improved the \(R^2\) of our model by ~1%.
This function also works with numpy arrays:
>>> X = df[['total_bill', 'size']].to_numpy() >>> y = df['tip'].to_numpy() >>> pg.linear_regression(X, y).round(2) names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%] 0 Intercept 0.67 0.19 3.46 0.00 0.47 0.46 0.29 1.05 1 x1 0.09 0.01 10.17 0.00 0.47 0.46 0.07 0.11 2 x2 0.19 0.09 2.26 0.02 0.47 0.46 0.02 0.36
Get the residuals
>>> # For clarity, only display the first 9 values >>> np.round(lm.residuals_, 2)[:9] array([-1.62, -0.55, 0.31, 0.06, -0.11, 0.93, 0.13, -0.81, -0.49])
Using pandas, we can show a summary of the distribution of the residuals:
>>> import pandas as pd >>> pd.Series(lm.residuals_).describe().round(2) count 244.00 mean -0.00 std 1.01 min -2.93 25% -0.55 50% -0.09 75% 0.51 max 4.04 dtype: float64
No intercept and return only the regression coefficients
Sometimes it may be useful to remove the constant term from the regression, or to only return the regression coefficients without calculating the standard errors or p-values. This latter can potentially save you a lot of time if you need to calculate hundreds of regression and only care about the coefficients!
>>> pg.linear_regression(X, y, add_intercept=False, coef_only=True) array([0.1007119 , 0.36209717])
Return a dictionnary instead of a dataframe
>>> lm_dict = pg.linear_regression(X, y, as_dataframe=False) >>> lm_dict.keys() dict_keys(['names', 'coef', 'se', 'T', 'pval', 'r2', 'adj_r2', 'CI[2.5%]', 'CI[97.5%]', 'df_model', 'df_resid', 'residuals', 'X', 'y', 'pred'])
Remove missing values
>>> X[4, 1] = np.nan >>> y[7] = np.nan >>> pg.linear_regression(X, y, remove_na=True, coef_only=True) array([0.65749955, 0.09262059, 0.19927529])
Get the relative importance of predictors
>>> lm = pg.linear_regression(X, y, remove_na=True, relimp=True) >>> lm[['names', 'relimp', 'relimp_perc']] names relimp relimp_perc 0 Intercept NaN NaN 1 x1 0.342503 73.045583 2 x2 0.126386 26.954417
The
relimp
column is a partitioning of the total \(R^2\) of the model into individual contribution. Therefore, it sums to the \(R^2\) of the full model. Therelimp_perc
is normalized to sum to 100%. See Groemping 2006 for more details.>>> lm[['relimp', 'relimp_perc']].sum() relimp 0.468889 relimp_perc 100.000000 dtype: float64
Weighted linear regression
>>> X = [1, 2, 3, 4, 5, 6] >>> y = [10, 22, 11, 13, 13, 16] >>> w = [1, 0.1, 1, 1, 0.5, 1] # Array of weights. Must be >= 0. >>> lm = pg.linear_regression(X, y, weights=w) >>> lm.round(2) names coef se T pval r2 adj_r2 CI[2.5%] CI[97.5%] 0 Intercept 9.00 2.03 4.42 0.01 0.51 0.39 3.35 14.64 1 x1 1.04 0.50 2.06 0.11 0.51 0.39 -0.36 2.44