3 Ways to Replace Missing Values with the Median per Group in R

In this article, we explain how to replace missing values in R with the group’s median.

The easiest way to replace missing values with the group’s median in R is with the tidyverse package. Firstly, you define the groups with the group_by function. Secondly, you use the mutate function to modify missing values. Finally, apply the ifelse function and the median function to replace the NA’s.

If you want to know how to replace NA´s with the median of a column (i.e., not per group), please read this article.

Replace Missing Values in R with the Median by Group

There are different ways to replace missing values in R. Here we discuss three to replace NA’s with the group’s median.

In this section, we’ll use the data frame below to demonstrate how each method works.

The data frame has two columns and ten rows. You can separate the data into two groups, namely “A” and “B”. Each group has one missing value.

This is the R code to create this data frame.

my_groups <- c(rep("A", 5), rep("B",5))
my_values <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_df <- data.frame(my_groups, my_values)

How to Replace Missing Values the Group’s Median with data.table

The first way to replace missing values in R with the median by group uses the data.table package.

The data.table package provides an improved version of the R base data.frame with syntax and feature enhancements for ease of use, convenience, and programming speed.

This is how to replace missing values with the group’s median with data.table

1. Use the setDT() function to transform a data frame into a data.table.

2. Specify the column that contains the missing values.

3. Use the := operator to calculate the new column value per group.

4. Use the ifelse() function to identify missing values and replace them with the median.

5. Specify the column that defines the groups with the by option.

The R code below shows an example of the steps above.

library(data.table)
setDT(my_df)

my_df[, my_values := ifelse(is.na(my_values), 
                            median(my_values, na.rm = TRUE), 
                            my_values), 
      by = my_groups]

How to Replace Missing Values the Group’s Median with plyr

The second way replaces NA’s with the group’s median with the plyr package.

The plyr package provides a set of tools to split data into homogenous sets, apply a function to each piece, and combine the results back together.

In our example, we split the data frame into two sets based on the my_groups column, replace NA’s with the median of each set, and combine the results.

This is how you replace NA’s with the median per group with plyr

1. Start the ddply() function.

2. Specify the data frame that contains the missing values.

3. Specify the column that defines the groups.

4. Use the transform option.

5. Specify the column that contains the missing values.

6. Use the ifelse() function to identify and replace missing values with the median.

7. Finish the ddply() function.

The R code below provides an example of the steps mentioned above.

library(plyr)
ddply(my_df, ~ my_groups, transform,
      my_values = ifelse(is.na(my_values), 
                         median(my_values, na.rm = TRUE), 
                         my_values))

How to Replace Missing Values the Group’s Median with tidyverse

The third way to replace missing values in R with the median per group uses the tidyverse package. This is probably the most convenient way because of its readability.

These are the steps to replace missing values in R with the group’s median

1. Use the group_by() function to specify the column that defines the group.

2. Use the mutate() function to modify the values in the column with the missing values.

3. Apply the ifelse() function to identify and replace NA’s with the median.

The code below provides an example.

my_df %>% 
  group_by(my_groups) %>% 
  mutate(my_values = ifelse(is.na(my_values), 
                            median(my_values, na.rm = TRUE), 
                            my_values))
Replace missing values in R with the group's median in one column.

Note that you can modify the R code above easily when not one, but multiple columns define the groups in your data frame. For example, group_by(column1, column2, etc.).

Replace Missing Values in R with the Median by Group in All Numeric Columns

So far, we’ve demonstrated how to handle missing values in one column. However, might have a data frame with many columns. Applying the techniques mentioned before to all columns can be a tedious task.

So, how do you replace values in all columns with the group’s median?

These are the steps

  1. Load the tidyverse library.
  2. Use the group_by() function to specify the column (or columns) that defines the groups.
  3. Apply the mutate_if() function to replace only missing values in numeric columns.
  4. Use the ifelse() function to identify NA’s and replace them with the median.

The R code below provides an example of how to easily replace missing in all columns with the group’s median.

library(tidyverse)
my_groups <- c(rep("A", 5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(3, NA, 4, 8, 2, 11, 15, NA, 9, 10)
my_df <- data.frame(my_groups, my_values_1, my_values_2)

my_df %>% 
  group_by(my_groups) %>% 
  mutate_if(is.numeric, 
            function(x) ifelse(is.na(x), 
                               median(x, na.rm = TRUE), 
                               x))

If you run the code above, this will be the result.

Replace missing values in R with the group's median in all numeric columns

Related Articles