In this article, we discuss **5 ways to check the normality of residuals in R**.

The normality of the residuals is one of the main assumptions of a linear regression model. If the residuals are not normally distributed, then model inference (i.e., model predictions and confidence intervals) might be invalid. Therefore, it is crucial to check this assumption.

**In R, the best way to check the normality of the regression residuals is by using a statistical test. For example, the Shapiro-Wilk test or the Kolmogorov-Smirnov test. Alternatively, you can use the “Residuals vs. Fitted”-plot, a Q-Q plot, a histogram, or a boxplot.**

In this article, we use basic R code and functions from the “olsrr” package to perform the checks mentioned above.

The “*olsrr*” package provides tools to quickly build regression models in R, as well as tests to check the assumptions. For example, the normality of the residuals. But, it also contains functions to test for multicollinearity and heteroscedasticity.

You install and load the “*olsrr*” package with the following command.

```
# Load Package "OLSRR"
install.packages("olsrr")
library("olsrr")
```

**5 Ways to Check that Regression Residuals are Normality Distributed in R**

Before discussing the 5 methods to check the normality of the residuals, we first create a linear regression model. For this purpose, we use the *auto* dataset from the “*olsrr*” package.

While loading the “*olsrr*” package, the *auto* data set is also automatically loaded. This data set contains 74 rows and 11 columns with information about cars. You can see the first 6 rows of the *auto* data set with the *head()* function.

```
# View the "auto" dataset
head(auto)
```

We will create a simple linear regression model that tries to predict the miles per gallon (mpg) of a car given its weight. You can use the *lm()* function to create such a regression model.

```
# Create a Linear Regression Model
my_model <- lm(mpg ~ weight, data = auto)
summary(my_model)
```

As the image shows, both the intercept and the variable *weight* are significant in this model. Moreover, the miles per gallon decrease when a car’s weight increases (note the negative regression coefficient of -0.006).

Now, we continue with checking that the residuals (i.e., the difference between the observed value and the predicted value) of our regression model are normally distributed.

**1. Check the Normality of Residuals with the “Residuals vs. Fitted”-Plot**

The first method to check the normality assumption is by creating a “Residuals vs. Fitted”-plot.

A “Residuals vs. Fitted”-plot is a scatter plot of the residuals on the y-axis and the fitted (i.e., predicted) value on the x-axis. **For the normality assumption to hold, the residuals should spread randomly around 0 and form a horizontal band.**

You can use the basic *plot()* function to create the “Residual vs. Fitted”-plot. To do so, it requires a fitted regression model (or *lm*-object) as its argument. See the example below.

```
# 1. Check the Residuals vs Fitted Plot I
plot(my_model)
```

An advantage of the plot generated by the *plot()* function is that it shows both a line around 0, as well as a red trend line. **If the red trend line is approximately flat and close to zero, then one can assume that the residuals are normally distributed.**

Alternatively, you can use the *ols_plot_resid_fit()* function from the “*olsrr*” package. This function also creates a “Residual vs. Fitted”-plot, but it doesn’t show a trend line. It only highlights a horizontal line that indicates zero.

```
# 1. Check the Residuals vs Fitted Plot II
ols_plot_resid_fit(my_model)
```

Because the residuals are close to zero and the trend line is relatively flat (first plot), we can assume that the residuals follow are normal distribution.

**Note**: You can also use the “Residual vs. Fitted”-plot to check for heteroscedasticity among the residuals.

**2. ****Check the Normality of Residuals with a** Q-Q Plot

**Check the Normality of Residuals with a**Q-Q Plot

The second method to check the normality assumption is by creating a Q-Q plot.

A Q-Q plot (or quantile-quantile plot) is a scatterplot that plots two sets of quantiles against one another. To check the normality of the residuals, you plot the theoretical quantiles of the normal distribution on the x-axis and the quantiles of the residual distribution on the y-axis. **If the Q-Q plot forms a diagonal line, you can assume that the residuals follow a normal distribution.**

Like the “Residuals vs. Fitted”-plot, you create a Q-Q plot with the *plot()* function. Again, you only need to provide a fitted regression model (i.e., an *lm*-object) as input.

For example:

```
# 2. Check the Normal Q-Q Plot I
plot(my_model)
```

Alternatively, you can use the *ols_plot_resid_qq()* function from the “*olsrr*” package. This function provides the same Q-Q plot, however, it is visually a bit more appealing.

```
# 2. Check the Normal Q-Q Plot II
ols_plot_resid_qq(my_model)
```

Looking at the Q-Q plot above, we see that most of the points are located on the diagonal line. Though the extremes deviate from the line and you might observe an “S”-shape. Therefore, this Q-Q plot is not conclusive regarding the normality of the residuals.

**3. Create a Histogram of the Residuals**

The third method to check the normality of the residuals in R is by creating a histogram.

A histogram counts the number of observations between some ranges. **To not violate the normality assumption, the histogram should be centered around zero and should show a bell-shaped curve. **A high frequency at the extremes of the histogram could indicate that the residuals are not normally distributed.

You create a simple histogram of the residuals with the *hist()* function. Therefore, you need to extract the residuals first. You can do this with the $-sign. For example:

```
# 3. Create a Histogram of the Residuals I
hist(my_model$residuals, main = "Residual Histogram")
```

Instead of the *hist()* function, you can also create a histogram with the *ols_plot_resid_hist()* function. An advantage of this function is that it adds a line that shows the theoretical distribution of the normal distribution.

```
# 3. Create a Histogram of the Residuals II
ols_plot_resid_hist(my_model)
```

The histogram above shows that most of the residuals fall around zero. Moreover, the number of observations in the tails (i.e., extremes) of the histogram is low. **Therefore, we conclude that residuals of our regression model follow a normal distribution**.

**4. Create a Boxplot of the Residuals**

The fourth method to check the normality of the residuals in R is by creating a boxplot.

A boxplot is a graph that shows the locality, spread, and skewness of a set of observations and can be used to examine if residuals are normally distributed.** If the residuals follow a normal distribution, then the observations are located around zero and the whiskers have the same size**. Additionally, the number of outliers is typically low.

You create a simple boxplot with the *boxplot()* function. This function only requires the residuals as mandatory input. Optionally, you can give the boxplot a title with the *main=*-argument.

```
# 4. Create a Boxplot I
boxplot(my_model$residuals, main="Residual Box Plot")
```

Like before, the “*olsrr*” package also offers a way to create a boxplot, namely the *ols_plot_resid_box()* function. As its argument, this function requires a fitted model (or *lm*-object). See the example below.

```
# 4. Create a Boxplot II
ols_plot_resid_box(my_model)
```

Looking at the boxplot above, we see that residuals are located around zero and the whiskers are almost of the same size. However, there around some outlines. Nevertheless, we might assume that the residuals are normally distributed.

**5. Perform a Normality Test**

The fifth, and **most conclusive way** to check the normality of the residuals in R is by using a formal normality test.

**The two most known tests to check the normality assumption are the Shapiro-Wilk test and the Kolmogorov-Smirnov test. Both test the null hypothesis that a set of observations (e.g., the residuals) do follow a normal distribution. By rejecting this null hypothesis we can assume that the residuals are not normally distributed.**

Two questions might arise:

- Which test should I use? The Shapiro-Wilk test or the Kolmogorov-Smirnov test?
- When do I reject the null hypothesis?

The type of test to use depends on the number of observations. In general, if you have less than 50 observations, you should use the Shapiro-Wilk test. Otherwise, the Kolmogorov-Smirnov test is better.

The criteria to reject the null hypothesis, and therefore conclude that the data is not normally distributed, depends on the p-value. Typically, if the p-value is <0.05 we reject the null hypothesis.

```
# 5. Perform a Normality Test
ols_test_normality(my_model)
```

Considering the number of observations in our dataset (74), we should use the Kolmogorov-Smirnov test to examine the normality of the residuals. **Because the p-value is below 0.05, we reject the null hypothesis and conclude that the residuals do not follow a normal distribution.**