In this article, we discuss how to find columns with NA’s in an R data frame.
Missing values (i.e., NA’s) can occur because of many reasons. For example, missing responses in a survey or unknown values in an imported CSV file. As a consequence of missing values, R might show an error, or even worse, calculate incorrect outcomes. Therefore, it is key to identify NA’s as soon as possible.
In R, the easiest way to find columns that contain missing values is by combining the power of the functions is.na() and colSums(). First, you check and count the number of NA’s per column. Then, you use a function such as names() or colnames() to return the names of the columns with at least one missing value.
Once you have identified the columns (or variables) with NA’s you can do a couple of things. For example, remove those columns, remove the rows with missing values, or replace the NA’s.
In this article, we focus on how to find columns with missing values. Therefore, we propose 3 methods that can all handle every data type, namely character, numeric, and factors.
Columns with Missing Values
In our examples, we use an R data frame with 5 columns called X1, X2, X3, X4, and X5. Three of these columns contain at least one missing value, namely X1, X3, and X4. Therefore, we expect the methods we discuss to return the names of these columns.
my_df <- data.frame(X1 = c(1, 2, NA, 4), X2 = c(5:8), X3 = c("A", NA, NA, "D"), X4 = as.factor(c("M", "F", "F", NA)), X5 = c("W", "X", "Y", "Z"))
Do you know: How to Count the Number of NA’s per Row?
3 Ways to Find Columns with Missing Values
The 3 methods hereafter to find columns with NA’s only use basic R code. Therefore, it is not necessary to install and load any package.
1. Find Columns with NA’s using the COLSUMS() Function
The easiest method to find columns with missing values in R has 4 steps:
- Check if a value is missing
The is.na() function takes a data frame as input and returns an object that indicates for each value if it is a missing value (TRUE) or not (FALSE). The output object of the is.na() function has the same dimensions as the input data frame.
- Count the number of missing values per column
Once you have converted the original data frame into a logical TRUE / FALSE matrix, you can count the number of missing values per column (i.e. the number of TRUEs). For this purpose, you can use the colSums() function.
- Identify the position of the columns with at least one missing value
The next step is to identify the columns with at least one missing value. To do so, you use the which() function and the condition colSums() > 0. The which() function returns the position of the column(s) with at least one NA.
- Return the column names with missing values
The last step is to convert the column positions into column names. You can do this easily with the names() function. This function returns a character vector with the column names that contain NA’s.
The R code below shows all the steps we discussed above. We print the output of each step separately to make the process easiest to follow.
is.na(my_df) colSums(is.na(my_df)) which(colSums(is.na(my_df))>0) names(which(colSums(is.na(my_df))>0))
As the image above shows, this method has correctly identified the columns with the missing values.
2. Find Columns with NA’s using the COLNAMES() Function
The second method to identify columns with missing values is very similar to the first method.
The first two steps are the same. First, we convert the original data frame into a TRUE/FALSE matrix to identify the missing values. Then, we use the colSums() function to count the number of missing values per column.
However, instead of using the which() and names() function, this method uses the colnames() function and the bracket notation to return the names of the columns.
The R code below shows the four steps.
is.na(my_df) colSums(is.na(my_df)) colSums(is.na(my_df)) > 0 colnames(my_df)[colSums(is.na(my_df)) > 0]
Again, this method returns the column names with missing values.
3. Find Columns with NA’s using the anyNA Funcion
The third method to find columns with NA’s in R requires less code than the previous methods but might be harder to follow.
Instead of using the is.na() and colSums() functions, we use the apply() function and the anyNA function.
The apply() function scans through all columns and carries out a specific operation. In our case, the operation is to find missing values. Therefore, we can use the anyNA function. When you combine the apply() and anyNA functions, R returns for each column if it contains at least one missing value.
Finally, you can use the colnames() function to filter the column names with NA’s. See the example below.
apply(my_df, 2, anyNA) colnames(my_df)[apply(my_df, 2, anyNA)]
Note that, the second argument of the apply() function indicates that we want to find missing values in columns (2) instead of rows (1).