In this article, we discuss 3 ways to test for multicollinearity in a multiple regression model built in R.
The absence of multicollinearity is one of the main assumptions of linear regression. On the contrary, if multicollinearity exists, the regression coefficients might be statistically insignificant. Therefore, it is crucial to check this assumption.
In R, the easiest way to test for multicollinearity among the independent variables is with the Tolerance and Variance Inflation Factors (VIF). The ols_vif_tol() function calculates the Tolerance and VIFs, and makes the detection of multicollinearity straightforward. Alternatively, you can use a Correlation Matrix or the Condition Index.
In this article, we show how to use the ols_vif_tol() function, as well as how to create a Correlation Matrix and how to calculate the Condition Index. We support our discussion with examples and R code that you can use directly in your projects.
What is Multicollinearity?
Multicollinearity is a statistical concept where two or more independent (predictor) variables of a multiple regression model are highly correlated. That is to say, one independent variable is a (linear) combination of the other independent variables.
The image below shows a multiple linear regression model where the independent variables x1 … xk predict the value of the dependent variable y.
Multicollinearity exists if one of the independent variables x1 … xk can be expressed as a linear combination of the other independent variables. In other words:
Why is Multicollinearity a Problem?
Multicollinearity does not influence the precision of the predictions, nor the goodness-of-fit statistics (e.g., R-squared). However, multicollinearity does become problematic when you want to draw conclusions about the independent variables.
As a result of multicollinearity, the coefficients of a multiple regression model cannot be interpreted meaningfully. In other words, the model is not able to clearly separate the effect of each predictor variable. Moreover, the p-values that indicate the statistical significance of the independent variables might not be reliable.
Hence, it is crucial to check for multicollinearity in your regression model.
3 Ways to Check for Multicollinearity in R
In this section, we will discuss 3 ways to detect multicollinearity in a multiple regression model built in R.
First, we create a regression model using 5 variables from the well-known mtcars dataset, namely:
- mpg: Number of miles per US gallon.
- cyl: Number of cylinders.
- disp: Displacement in cubic inches.
- hp: Gross horsepower.
- wt: Weight in pounds.
We use the tidyverse library to select these variables and create a new dataset my_data.
## CREATE A DATASET library("tidyverse") my_data <- mtcars %>% select(mpg, cyl, disp, hp, wt) head(my_data)
The goal of this multiple regression model is to predict the Miles per Gallon (mpg) based on a combination of the other four variables. Therefore, we use the lm() function.
## CREATE A LINEAR REGRESSION MODEL my_model <- lm(mpg~., data = my_data)
Note: By just common sense, one might expect multicollinearity in this model because of the variables disp and cyl. As the number of cylinders in a car (cyl) increases, the displacement (disp) must also increase. Hence, a high correlation between these variables must exist.
1. Test for Multicollinearity with a Correlation Matrix
The first way to test for multicollinearity in R is by creating a correlation matrix.
A correlation matrix (or correlogram) visualizes the correlation between multiple continuous variables. Correlations range always between -1 and +1, where -1 represents perfect negative correlation and +1 perfect positive correlation.
Correlations close to-1 or +1 might indicate the existence of multicollinearity. As a rule of thumb, one might suspect multicollinearity when the correlation between two (predictor) variables is below -0.9 or above +0.9.
In R, you can use the cor() function to create a standard correlation matrix. However, the corrplot() function creates a plot of the correlation matrix which is visually appealing and easy to interpret.
## 1. CORRELATION MATRIX library("corrplot") corrplot(cor(my_data), method = "number")
The image above shows the correlation matrix of the variables that are included in our regression model. The high correlation between disp and cyl (0.90) might indicate multicollinearity.
2. Test for Multicollinearity with Variance Inflation Factors (VIF)
The second method to test for multicollinearity in R is by looking at the Tolerance and the Variance Inflation Factor (VIF) values.
The Tolerance measures the percent of the variance in the independent variable that cannot be accounted for by the other independent variables. So, if the Tolerance is low, then the other independent variables are able to explain an increase or decrease in the value of the independent variable of interest. Hence, the variables are correlated and therefore multicollinearity might exist.
These are the steps to calculate the Tolerance:
- Create a regression model for the i-th independent variable using the rest of the independent variables (see image above).
- Compute the R2 for this regression model.
- Calculate the Tolerance = 1 – R2 .
As a general guideline, a Tolerance of <0.1 might indicate multicollinearity.
Variance Inflation Factor
The Variance Inflation Factor (VIF) measures the inflation in the coefficient of the independent variable due to the collinearities among the other independent variables. A VIF of 1 means that the regression coefficient is not inflated by the presence of the other predictors, and hence multicollinearity does not exist.
As a rule of thumb, a VIF exceeding 5 requires further investigation, whereas VIFs above 10 indicate multicollinearity. Ideally, the Variance Inflation Factors are below 3.
These are the steps to calculate the Variance Inflation Factors:
- Regress the i-th independent variable on the other independent variables (see image above).
- Calculate the R2 for this regression model.
- Compute the Variance Inflation Factor = 1/ (1 – R2 ) = 1 / Tolerance.
Tolerance and VIFs in R
In R, you can use the ols_vif_tol() function from the olsrr package to calculate the Tolerance and the Variance Inflation Factor values. The ols_vif_tol() function requires a fitted multiple regression model and returns a data frame with the variable name, its Tolerance, and its VIF.
## 2. TOLERANCE & VARIANCE INFLATION FACTOR (VIF) library("olsrr") ols_vif_tol(my_model)
As the image above shows, the variable disp has a Tolerance < 0.1 and a VIF above 10. Therefore, multicollinearity is highly likely.
3. Test for Multicollinearity with Eigenvalues and Condition Index
The third method to detect multicollinearity in R is by looking at the eigenvalues and the condition index.
The Condition Index (CI) is an alternative for the Variance Inflation Factors (VIF) to check for multicollinearity. The theory behind the Condition Index (and Eigen Values) is based on linear algebra and is too complex to discuss in this article. However, the way to use them is straightforward.
To detect multicollinearity, we use the Condition Number (i.e., the largest Condition Index). As a rule of thumb, a Condition Number between 10 and 30 indicates the presence of multicollinearity. Values above 30 are problematic.
A high Condition Number in combination with large portions of variance (0.50 or more) is a strong indicator of multicollinearity.
In R, you use the ols_eigen_cindex() function from the olsrr library to calculate the Eigen Values and the Condition Index. As its argument, this function only requires a fitted regression model.
## 3. EIGEN VALUES AND CONDITION INDICES library("olsrr") ols_eigen_cindex(my_model)
Because of the relatively high Condition Number (26.875) and the large portions of variance between the variables cyl and disp, multicollinearity is very likely.
All in all, based on the 3 tests above, we conclude that multicollinearity exists in our regression model.
How to Deal with Multicollinearity in R?
If you detect multicollinearity in your regression model, then the natural question is how to deal with it.
In general, there are 3 options to handle multicollinearity:
- Do nothing. As multicollinearity only affects the regression coefficients and not the actual predictions, it is not always necessary to do something. If you are only interested in the prediction of your model and not in the significance of the coefficients, you have to do nothing.
- Remove one or more variables. However, if you are interested in the regression coefficients, multicollinearity is a problem. Therefore, the easiest way to deal with it is by removing the variable(s) that cause this issue.
- Combing variables. Instead of removing variables, you could also combine the variables that cause multicollinearity into a new variable.