How to Replace Missing Values with the Minimum by Group in R

Working with data means working with missing values. You can use many values to substitute NA’s, e.g., the mean, a zero, or the minimum. In this article, we discuss how to replace missing values in R with the minimum per group.

The examples in this article assume that you’ve installed the tidyverse package. The tidyverse package is a powerful tool to, for example, manipulate data.

If you don’t have installed or loaded this package, you can run the code below.

if (!require("tidyverse")) install.packages("tidyverse")
library(tidyverse)

How to Replace Missing Values by Group in a Single Column

We start this article by explaining how to replace missing values in a single column with the lowest value of a group.

The R data frame below has two columns and ten rows. We can split the data into two groups based on the column my_groups. The goal is to substitute the NA’s with the minimum of each group.

R data frame with missing values

Replacing missing values with the minimum per group is a two-step process. Firstly, you use the group_by() function to identify the variable that defines the groups. Then, you use the mutate() function to specify in which variable you want to replace the NA’s.

Once you have specified the grouping variable and the variable to modify, you can use the replace_na() function and the min() function to replace the missing values with the lowest value.

Remember to add the na.rm = TRUE option to the min() function. Otherwise, the function doesn’t calculate the minimum and returns null.

The R code below shows how to impute the NA’s with the minimum considering groups.

my_groups <- c(rep("A",5), rep("B",5))
my_values <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_df <- data.frame(my_groups, my_values)
my_df

my_df <- my_df %>% 
  group_by(my_groups) %>% 
  mutate(my_values = replace_na(my_values, min(my_values, na.rm = TRUE)))
my_df
Replace missing values by group in R with minimum

In this example, we used one variable to specify the groups. However, if your groups are based on multiple variables, you can add more than one variable to the group_by() function.

How to Replace Missing Values by Group in Multiple Columns

Above we’ve demonstrated how you can replace missing values in one column. However, if your data frame has many columns, then replacing NA’s in one column at a time can be a tedious task.

So, in this section, we show how to replace missing values in multiple columns in an efficient way that doesn’t require much R code.

Replace Missing Values by Group in All Numeric Columns

The easiest way to replace missing values in multiple columns is to replace them in all numeric columns.

For example, the data frame below has two numeric columns and the goal is to replace all the missing values without explicitly specifying each column name.

Data frame with multiple columns

The first step is to specify the variable(s) that define the groups. You can use the group_by() function for this.

The second step is to use the mutate() and across() function. While the mutate() function modifies the values in one or more variables, the across() function lets you apply an operation to these variables.

You can use the everything() function as the first argument of the across() function to apply an operation to all columns. With the replace() function you replace the missing values with, for example, the minimum.

The R code below shows an example of how to replace the missing values in all numeric columns with the minimum of each group.

my_groups <- c(rep("A",5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(6, NA, 13, 8, 2, 11, 15, NA, 9, 10)
my_df <- data.frame(my_groups, my_values_1, my_values_2)
my_df

my_df <- my_df %>% 
  group_by(my_groups) %>% 
    mutate(
      across(everything(), ~replace_na(.x, min(.x, na.rm = TRUE)))
    )
my_df
Impute all NA's in R with the minimum per group

Replace Missing Values by Group in Multiple Columns Based on Their Names

Instead of replacing the missing values is all numeric columns, you can also explicitly specify in which columns you want to replace them. For example, based on the column names.

The data frame in the image below has several numeric columns with missing values. The goal is to impute the NA’s only in the columns my_values_1 and your_values_2.

Data frame with multiple columns

You can use the across() function to specify explicitly in which columns you want to replace missing values. If you want to specify the columns based on their names, you must create a vector of the names enclosed quotes and separated them by a comma.

The code below shows an example of how to use the across() function to impute NA’s with the lowest value per group.

The first argument of the across() function specifies in which column you want to replace the missing values. The second argument contains an operation. In this case, we use the replace_na() function to replace missing values with the minimum.

my_groups <- c(rep("A",5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(6, NA, 13, 8, 2, 11, 15, NA, 9, 10)
your_values_1 <- c(NA, 7, NA, 1, 2, 6, NA, NA, 9, 13)
your_values_2 <- c(8, 3, 2, NA, NA, NA, 6, 7, 9, NA)
my_df <- data.frame(my_groups, my_values_1, my_values_2, your_values_1, your_values_2)
my_df

my_df <- my_df %>% 
  group_by(my_groups) %>% 
  mutate(
    across(c("my_values_1", "your_values_2"), ~replace_na(.x, min(.x, na.rm = TRUE)))
  )
my_df

By default, the min() function doesn’t ignore missing values. So, it’s important to add the na.rm = TRUE option to remove the NA’s before calculating the lowest value.

Replace Missing values based on column names

As the image above demonstrates, we’ve replaced the missing values with each group’s minimum, only for the columns my_values_1 and your_values_2.

Replace Missing Values by Group in Multiple Columns Based on Their Position

Besides specifying the columns based on their name, you can only replace missing values in columns based on their position. That is to say, the first column, the second column, etc.

But, how do you replace missing values in columns based on their position?

The easiest way to replace missing values in R with the lowest value per group for some columns based on their position is with the across() function. The first argument of this function lets you define the column numbers, i.e. position, in which you want to replace the NA’s.

The example below shows the R code to replace the missing values in the third and fourth column.

Important: If you use the group_by() function in combination with the across() function, the across() function ignores the grouping column. In other words, if your grouping variable is the first column and you want to replace the missing values in the third and fourth column, the first argument of the across() function is c(2:3) (instead of c(3:4)).

my_groups <- c(rep("A",5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(6, NA, 13, 8, 2, 11, 15, NA, 9, 10)
your_values_1 <- c(NA, 7, NA, 1, 2, 6, NA, NA, 9, 13)
your_values_2 <- c(8, 3, 2, NA, NA, NA, 6, 7, 9, NA)
my_df <- data.frame(my_groups, my_values_1, my_values_2, your_values_1, your_values_2)
my_df

my_df <- my_df %>% 
  group_by(my_groups) %>% 
  mutate(
    across(c(2:3), ~replace_na(.x, min(.x, na.rm = TRUE)))
  )
my_df
Replace missing values by group for a range of columns

Replace Missing Values by Group in Columns with a Common Prefix/Suffix

Lastly, we explain how to replace missing values in columns that have a common prefix or suffix. For example, all columns starting with “var_” or end with “_2021“.

The data frame below has 5 columns and 10 rows. The rows can be separated into 2 groups by the my_groups variable. The goal is to replace the missing values in the numeric columns that start with “my_“. We want to do this without explicitly specifying the complete column names.

These are the steps to replace missing values in columns sharing a common prefix/suffix with the group’s minimum:

The first step is to create a variable with the positions of the columns with the prefix/suffix. You can do this with the grep() function. Remember to exclude any grouping variables while determining the positions.

For example, if you need the positions of the columns that start with “my_” and the grouping variable is “my_groups” you can use the following R code.

selected_columns <- grep("my_", names(my_df[, names(my_df) != "my_groups"] ))
selected_columns 

The second step is to create groups in your data frame. You can do this with the group_by() function. For example, group_by(my_groups).

The third step uses the mutate() function to modify the columns with missing values.

The last step is to actually replace the missing values. You can do this with the across() function which performs the same operation on multiple columns.

The across() function needs two arguments:

  1. The .cols argument to select the columns you want to operate on.
  2. The fns argument. This argument is normally a function that will be apply to the selected columns. For example, you can use ~replace_na(.x, min(.x, na.rm = TRUE)) to replace missing values with the column’s minimum (per group).

The following R code shows an example of the steps above.

my_groups <- c(rep("A",5), rep("B",5))
my_values_1 <- c(4, 9, 10, NA, 5, 12, NA, 7, 11, 8)
my_values_2 <- c(6, NA, 13, 8, 2, 11, 15, NA, 9, 10)
your_values_1 <- c(NA, 7, NA, 1, 2, 6, NA, NA, 9, 13)
your_values_2 <- c(8, 3, 2, NA, NA, NA, 6, 7, 9, NA)
my_df <- data.frame(my_groups, my_values_1, my_values_2, your_values_1, your_values_2)
my_df

selected_columns <- grep("my_", names(my_df[, names(my_df) != "my_groups"] ))

my_df <- my_df %>% 
  group_by(my_groups) %>% 
  mutate(
    across(selected_columns, ~replace_na(.x, min(.x, na.rm = TRUE)))
  )
my_df

When you run the code above, R replaces the NA’s in the columns that start with “my_” with the lowest value of the column calculated per group.

Replace Missing Values in R with the Minimum by Group for Columns that have a common prefix/suffix

Related Articles