Working with data means dealing with missing values. In many analyses, missing values can be a problem. That is why we discuss in this article how to replace missing values with the median in R.
In R, you replace missing values with the column median using the tidyverse package. You apply the ifelse() function to first identify the NA’s, and then replace them with the column median. The median() function helps you to calculate the median.
In the remainder of this article, we show the exact R code to substitute NA’s with the median. However, if you want to replace missing values with the group’s median, we recommend reading this article.
How to Replace Missing Values with the Median
In this section, we demonstrate how you can replace the missing values in a particular column with the median.
We will use the data frame below to support the examples in this section. This data frame has 10 observations of which 2 are NA’s. The goal is to replace them with the median (in this case 8.5).
If you want to try the examples by yourself, you can use the following R code to reproduce the data frame above.
my_values <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8) my_df <- data.frame(my_values)
How to Replace Missing Values with the Median with Base R
The first method to replace NA’s in R with the median uses only R base code. Hence, it isn’t necessary to download any packages.
The first step to replace the missing values with the median is identifying the NA’s. You can identify them with the $ operator and the is.na() function. The is.na() function checks all values in the specified column and returns a True if the value is missing. If the value is not missing, it returns a False.
The second step is to calculate the median (ignoring the missing values) and assign the outcome to all rows where the values of the specified column were NA.
The R code below shows an example.
my_df$my_values[is.na(my_df$my_values)] <- median(my_df$my_values, na.rm = T)
How to Replace Missing Values with the Median with tidyverse
The second method to replace missing values in R with the median uses the tidyverse package.
This method is more user-friendly than the first method, mainly because of its readability. So, how do you replace missing values in R with the median?
These are the steps
- Start the mutate() function.
- Specify the column in which you want to replace the missing values.
- Use the ifelse() function to identify NA’s, and once found, replace them with the median.
- Finish the mutate() function.
Most of the work is done by the ifelse() function. So, let’s discuss in more detail how this function works.
This is how you use the ifselse() function to replace missing values:
- Evalute wether a value is missing. You can do this with the is.na() function.
- In case that the value is missing, replace the value with the median. You can calculate the median with the median() function. Note, that you must ignore NA’s when calculating the median. You can do this with the na.rm option.
- If the values is not missing, then use the original value.
The R code below shows how to replace NA’s with the median using the tidyverse package.
my_df %>% mutate(my_values = ifelse(is.na(my_values), median(my_values, na.rm = T), my_values))
How to Replace Missing Value with the Median in All Numeric Columns
So far, we’ve demonstrated how to substitute missing values in a specific column.
However, if you have many numeric columns with missing values, then repeating the code above for each column can be a tedious task. So, how do you replace missing values with the median in all numeric columns?
The data frame in the image below has two numeric columns, both the NA’s. The goal is to create generic R code that replaces the NA’s in all columns without repeating the same code multiple times.
To replace the missing values with the median in all the numeric columns of a data frame, you can use the tidyverse packages. First, you use the mutate_if() function to identify numeric columns. Then, with a user-written function, you replace the missing values with the median.
The following R code shows how to impute the missing values with the median for all numeric columns.
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8) my_values_2 <- c(6, NA, 13, 8, 2, 11, 15, NA, 9, 10) my_df <- data.frame(my_values_1, my_values_2) view(my_df) my_df %>% mutate_if(is.numeric, function(x) ifelse(is.na(x), median(x, na.rm = T), x))
As you can see in the image above, the R code substituted all NA’s with the median of the column. In the first column, the missing values were replaced with 8.5, and in the second column with 9.5.