How to Extract Words from a String in R [Examples]

This article discusses how to extract words from a string in R. For example, the first word or the last word.

The word() function from the stringr package provides the most convenient way to extract words from a string in R. Using the start=-argument and the end=-argument, you can read the first word, the last word, the first two words, etc.

However, if your string contains punctuation (i.e., special characters), you need the removePunctuation() function or a function from the stringi package to extract a specific word.

Next, we’ll give some examples of extracting words from a string, either with or without punctuation. To follow these examples, you need to install the following packages.

install.packages("stringr")
library("stringr")

install.packages("stringi")
library("stringi")

install.packages("tm")
library("tm")

How to Extract Words from a String in R without Punctuation

If you want to extract one or more words from a string that does not contain punctuation, your best option is the word() function from the stringr package.

This is the syntax of the word() function:

word(string, start, end, sep)

where:

  • string is a single variable, a vector of strings, or a character column in a data frame.
  • start defines the first word to extract.
  • end defines the last word to extract.
  • sep specifies the separator between two words. By default a single space.

So, for example, if you want to extract the first word and the second word from a string, you can use start=1 and end=2.

The words() function is very flexible as it can also read a string from right to left (i.e., backward). You just need to provide negative values for the start=-argument and the stop=-argument to scan the string in reverse order.

For instance, you can read the last two words from a string using the arguments: start=-2 and end=-1.

Normally, the words in a string a separated by a single space. However, if the words in your string a divided by another character (e.g., a forward slash or a dash), you can use the sep=-argument to specify the separator. For example, sep=”/”.

Next, we provide some examples. (Please, make sure you’ve installed the stringr package)

Read the First Word of a String

# Read the First Word of a String
my_string <- "Today I eat a sandwich for lunch"
word(my_string, start = 1, end = 1)
Extract the first word from a string in R

Read the Last Word of a String

# Read the Last Word of a String
my_string <- "Today I eat a sandwich for lunch"
word(my_string, start = -1, end = -1)
Extract the last word from a string in R

Read the First N Words of a String

# Read the First N Words of a String
my_string <- "Today I eat a sandwich for lunch"

first_n_words <- 3
word(my_string, start = 1, end = first_n_words)

Read the Last N Words of a String

# Read the Last N Words of a String
my_string <- "Today I eat a sandwich for lunch"

last_n_words <- 3
word(my_string, start = -last_n_words, end = -1)

Extract Words from a String in R with dplyr

Because the stringr package is part of the tidyverse, you can use the word() function in combination with dplyr functions.

For example, below we use the word() function in combination with the mutate() function to create new variables.

# Extract Words in dplyr
my_data <- data.frame(my_text = c("Today I eat a sandwich for lunch",
                                  "Yesterday I had a hamburger for dinner",
                                  "Tomorrow I'll have a smoothie for breakfast"))

my_data %>% 
  mutate(first_word = word(my_text, start = 1, end = 1),
         last_word = word(my_text, start = -1, end = -1),
         third_and_fourth_word = word(my_text, start = 3, end = 4))

How to Extract Words from a String in R with Punctuation

So far, the text strings in the examples above didn’t contain punctuation.

However, if your string contains special characters, the word() function might not return the expected result.

For example:

my_string <- "Today, I eat a sandwich for lunch."
word(my_string, start = 1, end = 1)
my_string <- "Today, I eat a sandwich for lunch."
word(my_string, start = -1, end = -1)

As you can see, the word() function returns the words including the punctuation instead of only the word.

In order to extract words from a string and ignore the punctuation, you can do two things:

  1. Remove the punctuation first with the removePunctuation() function.
  2. Use a function from the stringi package to extract the first or last word.

The most robust option to ignore special characters while reading words from a string is removing punctuation first with the removePunctuation() function. This function removes special characters from a string, such as commas, dots, question marks, etc.

For example:

my_string <- "Today, I eat a sandwich for lunch."
word(removePunctuation(my_string), start = 1, end = 1)
Extract the first word from a string with punctuation in R

Or

my_string <- "Today, I eat a sandwich for lunch."
word(removePunctuation(my_string), start = -1, end = -1)
Extract the last word from a string with punctuation in R

Instead of removing punctuation with a special function, you can also use the functions stri_extract_first_words() or stri_extract_last_words() from the stringi package to extract the first or last words from a string in R.

As the name suggests, these functions read the first (or last) word from a string. Moreover, they ignore punctuation.

Unfortunately, the stringi package only provides the stri_extract_first_words() function and the stri_extract_last_words() function. Therefore, it is not possible to extract the second word, the two last words, etc. with a function from this package.

For example:

my_string <- "Today, I eat a sandwich for lunch."
stri_extract_first_words(my_string)
my_string <- "Today, I eat a sandwich for lunch."
stri_extract_last_words(my_string)

Related Topics