5 Ways to Check the Normality of Residuals in R [Examples]

In this article, we discuss 5 ways to check the normality of residuals in R.

The normality of the residuals is one of the main assumptions of a linear regression model. If the residuals are not normally distributed, then model inference (i.e., model predictions and confidence intervals) might be invalid. Therefore, it is crucial to check this assumption.

In R, the best way to check the normality of the regression residuals is by using a statistical test. For example, the Shapiro-Wilk test or the Kolmogorov-Smirnov test. Alternatively, you can use the “Residuals vs. Fitted”-plot, a Q-Q plot, a histogram, or a boxplot.

In this article, we use basic R code and functions from the “olsrr” package to perform the checks mentioned above.

The “olsrr” package provides tools to quickly build regression models in R, as well as tests to check the assumptions. For example, the normality of the residuals. But, it also contains functions to test for multicollinearity and heteroscedasticity.

You install and load the “olsrr” package with the following command.

# Load Package "OLSRR"
install.packages("olsrr")
library("olsrr")

5 Ways to Check that Regression Residuals are Normality Distributed in R

Before discussing the 5 methods to check the normality of the residuals, we first create a linear regression model. For this purpose, we use the auto dataset from the “olsrr” package.

While loading the “olsrr” package, the auto data set is also automatically loaded. This data set contains 74 rows and 11 columns with information about cars. You can see the first 6 rows of the auto data set with the head() function.

# View the "auto" dataset
head(auto)
An R dataset

We will create a simple linear regression model that tries to predict the miles per gallon (mpg) of a car given its weight. You can use the lm() function to create such a regression model.

# Create a Linear Regression Model
my_model <- lm(mpg ~ weight, data = auto)
summary(my_model)
An simple linear regression model in R

As the image shows, both the intercept and the variable weight are significant in this model. Moreover, the miles per gallon decrease when a car’s weight increases (note the negative regression coefficient of -0.006).

Now, we continue with checking that the residuals (i.e., the difference between the observed value and the predicted value) of our regression model are normally distributed.

1. Check the Normality of Residuals with the “Residuals vs. Fitted”-Plot

The first method to check the normality assumption is by creating a “Residuals vs. Fitted”-plot.

A “Residuals vs. Fitted”-plot is a scatter plot of the residuals on the y-axis and the fitted (i.e., predicted) value on the x-axis. For the normality assumption to hold, the residuals should spread randomly around 0 and form a horizontal band.

You can use the basic plot() function to create the “Residual vs. Fitted”-plot. To do so, it requires a fitted regression model (or lm-object) as its argument. See the example below.

# 1. Check the Residuals vs Fitted Plot I
plot(my_model)
Residual vs Firred Plot to check the normality of the residuals in R

An advantage of the plot generated by the plot() function is that it shows both a line around 0, as well as a red trend line. If the red trend line is approximately flat and close to zero, then one can assume that the residuals are normally distributed.

Alternatively, you can use the ols_plot_resid_fit() function from the “olsrr” package. This function also creates a “Residual vs. Fitted”-plot, but it doesn’t show a trend line. It only highlights a horizontal line that indicates zero.

# 1. Check the Residuals vs Fitted Plot II
ols_plot_resid_fit(my_model)
Check the normality assumption of the residuals in R with a "Residuals vs. Fitted"-plot

Because the residuals are close to zero and the trend line is relatively flat (first plot), we can assume that the residuals follow are normal distribution.

Note: You can also use the “Residual vs. Fitted”-plot to check for heteroscedasticity among the residuals.

2. Check the Normality of Residuals with a Q-Q Plot

The second method to check the normality assumption is by creating a Q-Q plot.

A Q-Q plot (or quantile-quantile plot) is a scatterplot that plots two sets of quantiles against one another. To check the normality of the residuals, you plot the theoretical quantiles of the normal distribution on the x-axis and the quantiles of the residual distribution on the y-axis. If the Q-Q plot forms a diagonal line, you can assume that the residuals follow a normal distribution.

Like the “Residuals vs. Fitted”-plot, you create a Q-Q plot with the plot() function. Again, you only need to provide a fitted regression model (i.e., an lm-object) as input.

For example:

# 2. Check the Normal Q-Q Plot I
plot(my_model)
A QQ Plot in R

Alternatively, you can use the ols_plot_resid_qq() function from the “olsrr” package. This function provides the same Q-Q plot, however, it is visually a bit more appealing.

# 2. Check the Normal Q-Q Plot II
ols_plot_resid_qq(my_model)
Test the Normality of Residuals in R with a Q-Q plot.

Looking at the Q-Q plot above, we see that most of the points are located on the diagonal line. Though the extremes deviate from the line and you might observe an “S”-shape. Therefore, this Q-Q plot is not conclusive regarding the normality of the residuals.

3. Create a Histogram of the Residuals

The third method to check the normality of the residuals in R is by creating a histogram.

A histogram counts the number of observations between some ranges. To not violate the normality assumption, the histogram should be centered around zero and should show a bell-shaped curve. A high frequency at the extremes of the histogram could indicate that the residuals are not normally distributed.

You create a simple histogram of the residuals with the hist() function. Therefore, you need to extract the residuals first. You can do this with the $-sign. For example:

# 3. Create a Histogram of the Residuals I
hist(my_model$residuals, main = "Residual Histogram")

Instead of the hist() function, you can also create a histogram with the ols_plot_resid_hist() function. An advantage of this function is that it adds a line that shows the theoretical distribution of the normal distribution.

# 3. Create a Histogram of the Residuals II
ols_plot_resid_hist(my_model)
Use a histogram to test the normality of residuals in R

The histogram above shows that most of the residuals fall around zero. Moreover, the number of observations in the tails (i.e., extremes) of the histogram is low. Therefore, we conclude that residuals of our regression model follow a normal distribution.

4. Create a Boxplot of the Residuals

The fourth method to check the normality of the residuals in R is by creating a boxplot.

A boxplot is a graph that shows the locality, spread, and skewness of a set of observations and can be used to examine if residuals are normally distributed. If the residuals follow a normal distribution, then the observations are located around zero and the whiskers have the same size. Additionally, the number of outliers is typically low.

You create a simple boxplot with the boxplot() function. This function only requires the residuals as mandatory input. Optionally, you can give the boxplot a title with the main=-argument.

# 4. Create a Boxplot I
boxplot(my_model$residuals, main="Residual Box Plot")

Like before, the “olsrr” package also offers a way to create a boxplot, namely the ols_plot_resid_box() function. As its argument, this function requires a fitted model (or lm-object). See the example below.

# 4. Create a Boxplot II
ols_plot_resid_box(my_model)

Looking at the boxplot above, we see that residuals are located around zero and the whiskers are almost of the same size. However, there around some outlines. Nevertheless, we might assume that the residuals are normally distributed.

5. Perform a Normality Test

The fifth, and most conclusive way to check the normality of the residuals in R is by using a formal normality test.

The two most known tests to check the normality assumption are the Shapiro-Wilk test and the Kolmogorov-Smirnov test. Both test the null hypothesis that a set of observations (e.g., the residuals) do follow a normal distribution. By rejecting this null hypothesis we can assume that the residuals are not normally distributed.

Two questions might arise:

  1. Which test should I use? The Shapiro-Wilk test or the Kolmogorov-Smirnov test?
  2. When do I reject the null hypothesis?

The type of test to use depends on the number of observations. In general, if you have less than 50 observations, you should use the Shapiro-Wilk test. Otherwise, the Kolmogorov-Smirnov test is better.

The criteria to reject the null hypothesis, and therefore conclude that the data is not normally distributed, depends on the p-value. Typically, if the p-value is <0.05 we reject the null hypothesis.

# 5. Perform a Normality Test
ols_test_normality(my_model)
Use the Shapiro-Wilk test or the Kolomogorov-Smirnov test to check the normality of residuals in R.

Considering the number of observations in our dataset (74), we should use the Kolmogorov-Smirnov test to examine the normality of the residuals. Because the p-value is below 0.05, we reject the null hypothesis and conclude that the residuals do not follow a normal distribution.