3 Ways to Find & Remove Duplicated Columns in R [Examples]

In this article, we demonstrate 3 easy ways to find and remove duplicated columns from an R data frame. For that reason, we focus solely on removing columns with identical content, not with similar names.

Although less frequent than duplicated rows, you might face data frames with duplicated columns. Though sometimes you want to repeat columns in a data frame, in general, you want to get rid of them. So, how do you find and eliminate duplicated columns?

The easiest way to remove duplicated columns from a data frame is by using the duplicated() function. This function returns a logical vector indicating which column(s) appear more than once in a data frame. By using this logical vector and square brackets [], one can easily remove the duplicated columns.

In the remainder of this article, we discuss 3 ways to remove identical columns. Although the duplicated() function is the backbone of each method, there are some differences. Especially in terms of performance.

If you want to know how to remove columns in general, check out this article.

3 Ways to Find & Remove Duplicated Columns from a R Data Frame

Before we discuss the 3 methods, we first create a data frame that will serve in all the examples.

The example data frame contains 5 columns (x1-x5) and 5 rows. Columns x1 and x4 are identical, and therefore we want to remove column x4.

my_df <- data.frame(x1 = c(1:5),
                    x2 = letters[1:5],
                    x3 = seq(10,50,by=10),
                    x4 = c(1:5),
                    x5 = LETTERS[1:5])
my_df
An R data frame with duplicated columns

Although columns x2 and x5 seem similar, they are actually different. R is case-sensitive, and therefore the methods discussed below will remove neither of them. However, if necessary, you can turn all character columns into uppercase before finding and removing duplicated columns.

1. Find & Remove Duplicated Columns by Converting a Data Frame into a List

The first method to eliminate duplicated columns in R is by using the duplicated() function and the as.list() function.

The duplicated() function determines which elements of a vector, list, or data frame are duplicates. However (and unfortunately), we can’t use a data frame as the argument of this function and expect it to return the duplicated columns. Instead, R would look for duplicated rows.

In other words, this won’t work if you want to find duplicated columns:

duplicated(dataframe)

Therefore, you first need to convert your data frame into a list and use it as the argument of the duplicated() function. As a result, the duplicated() function returns a logical vector (i.e., vector with TRUEs and FALSEs) indicating which columns are identical. Next, you can use the square brackets [] to select the unique columns.

This R code combines all the mentioned steps.

# Find the Duplicated Columns
duplicated_columns <- duplicated(as.list(my_df))


# Show the Names of the Duplicated Columns
colnames(my_df[duplicated_columns])


# Remove the Duplicated Columns
my_df[!duplicated_columns]
Find and remove duplicated columns in R by converting a data frame into a list

2. Find & Remove Duplicated Columns by Transposing a Data Frame

The second method to find and remove duplicated columns in R is by using the duplicated() function and the t() function.

This method is similar to the previous method. However, instead of creating a list, it transposes the data frame before applying the duplicated() function.

Remember that, if its argument is a data frame, the duplicated() function returns the duplicated rows. Therefore, by transposing a data frame first, its columns become rows. Next, the duplicated() function indicates which rows (i.e., the original columns) appear more than once.

Summarizing, these are the steps to remove duplicated columns with the duplicated() function and the t() function:

  1. Tranpose your data frame with the t() function. In other words, convert the columns into rows (and vice versa).
  2. Use the duplicated() function to create a vector that indicates which columns are identical.
  3. Optionall, show the names of the duplicated columns using the colnames() function.
  4. Remove the duplicated columns with the square brackets [] and the !-symbol.
# Find the Duplicated Columns
duplicated_columns <- duplicated(t(my_df))

# Show the Names of the Duplicated Columns
colnames(my_df[duplicated_columns])

# Remove the Duplicated Columns
my_df[!duplicated_columns]
Find and remove duplicated columns in R by transposing a data frame

3. Find & Remove Duplicated Columns by Using a Hash Object

The third method to identify and remove duplicated columns in R uses a hash object.

In general terms, a hash object provides an efficient and convenient way of storing and retrieving data. Therefore, without going into too much detail, this method is the best when it comes to removing duplicated columns and performance.

These are the steps to find and remove duplicated columns efficiently:

  1. Load the digest package

    This package provides the digest() function for the creation of hash objects in R.

  2. Create hash objects

    Create a hash object of each column in the data frame with the sapply() function and the digest() function.

  3. Compare the hash objects

    Compare the hash objects to find any duplicates using the duplicated() function. A duplicate in a hash object means a duplicate in the columns.

  4. Remove the duplicated columns

    Remove the duplicated columns found in the previous steps with the square brackets [] and the !-symbol.

The next R code combines all the steps.

# Load the "digest" Package
library(digest)

# Find the Duplicated Columns
duplicated_columns <- duplicated(sapply(my_df, digest))

# Show the Names of the Duplicated Columns
colnames(my_df[duplicated_columns])

# Remove the Duplicated Columns
my_df[!duplicated_columns]
Find and remove duplicated columns in R by using a Hash Object

To better understand why this method is the most efficient, take a look at the following R output.

As you can see, the sapply() function and the digest() function have converted each column into a hash object. These hashes are merely simple strings. Therefore, finding duplicated columns is just finding duplicated strings. This is much easier (i.e., faster) than comparing all elements of each column one by one to make sure that columns are identical.