In the last chapter, we focused on importing data into tibbles and then reshaping them to fit the tidy data criteria. In most cases, we had data with some structure, which we transformed into a different structure. This week, we look at working with strings for three reasons: cleaning messy data, filtering rows based on part of string matches, and extracting data from text.
4.2.1 Cleaning messy data
Sometimes, you may have data with a correct tidy structure, but the data itself is not clean and contains errors, unnecessary characters, or unwanted spelling or formatting variants. We need to clean that data before we can produce our analysis or report. Here is an example:
full_names (messy)
full_names (tidy)
Colin Conrad, PhD
Colin Conrad
MACDONALD, Betrum
Bertrum MacDonald
Dr. Louise Spiteri
Louise Spiteri
Mongeon, Philippe
Philippe Mongeon
jennifer grek-martin
Jennier Grek-Martin
4.2.2 Filtering rows based on string matches
In the last chapter, we learned how to filter rows of a tibble based on the value contained in a cell or based on the row number. This week, we will add to our toolbox some string matching functions that check if a string of characters is found within a larger string of characters. One example could be retrieving a set of course codes starting with INFO or MGMT in a vector containing the course codes of all offerings of Dalhousie University.
4.2.3 Extracting data from text
Sometimes, you may have to deal with unstructured data such as a long character string containing data elements we wish to extract. This string, for example:
I am taking several courses offered at SIM this Winter. There is INFO6270 (Introduction to Data Science) and also INFO6540 and the information policy one, which I think has the course code INFO6610.
Maybe you had the brilliant idea to use a free text field in a survey to collect information about the courses that students are taking this Winter, and you now have three thousand responses that look like this one. This unstructured data needs to be structured before it can be analyzed, and in this specific example, and R can help! This kind of task can be relatively simple but can get quite complex. In this chapter, we will not do very complex data extractions from strings.
4.3 The stringr package
The stringr package (https://stringr.tidyverse.org) is part of the tidyverse and contains a collection of functions that perform all kinds of operations on strings. Let’s go through some of those tasks and some code examples.
4.3.1 Transforming strings
4.3.1.1 change string character case
One simple transformation you may want to perform on a string is changing its case. This is very easily done with the str_to_lower(), str_to_upper(), str_to_sentence(), and str_to_title() functions.
Statement
Output
str_to_lower(“HeLlO WoRlD!”)
hello world!
str_to_upper(“HeLlO WoRlD!”)
HELLO WORLD!
str_to_sentence(“HeLlO WoRlD!”)
Hello world!
str_to_title(“HeLlO WoRlD!”)
Hello World!
4.3.1.1.1 Vector example
# I create a vector with character stringsvector <-c("I like coding with R","i like coding in R","R IS AMAZING!","I LoVe R")# I convert them all to lowercase.str_to_lower(vector)
[1] "i like coding with r" "i like coding in r" "r is amazing!"
[4] "i love r"
4.3.1.1.2 Tibble example
# I create a tibble with inconsistent stringst <-tibble(comments =c("I like coding with R","i like coding in R","R IS AMAZING!","I LoVe R"))# I use the mutate() and str_to_lower function to modify the messy column and make the strings consistent. t %>%mutate(comments =str_to_lower(comments))
# A tibble: 4 × 1
comments
<chr>
1 i like coding with r
2 i like coding in r
3 r is amazing!
4 i love r
4.3.1.2 Replacing parts of strings
The functions str_replace() and str_replace_all() modify strings by replacing a pattern with another. The difference between the two is that str_replace() will only replace the first instance of the pattern in the string, while str_replace_all() will replace all the instances.
4.3.1.2.1 Vector example
# I create a vector with two strings.names <-c("dr Mike Smit","dr Sandra Toze")# I replace the first instance of the pattern "dr" with "doctor". names %>%str_replace("dr","doctor")
[1] "doctor Mike Smit" "doctor Sandra Toze"
Let’s see what happens if I use the same example but use str_replace_all() instead of str_replace().
# I create a vector with two strings.names <-c("dr Mike Smit","dr Sandra Toze")# I replace ALL instances of the pattern "dr" with "doctor". names %>%str_replace_all("dr","doctor")
[1] "doctor Mike Smit" "doctor Sandoctora Toze"
The second string got messed up because the second “dr” pattern in Sandra also got replaced with the pattern “doctor”.
4.3.1.3 Removing parts of strings
The str_remove() and str_remove_all() are the equivalent of str_replace("some pattern", "") and str_replace_all("some pattern", ""). They can make our code a little cleaner by not requiring that we specify that we want to replace a given pattern with nothing.
# I create a vector with namesnames <-c("dr Mike Smit","dr Sandra Toze")# I remove the first instance of the pattern "dr" from the names.names %>%str_remove("dr")
[1] " Mike Smit" " Sandra Toze"
4.3.1.3.1 Tibble example
# I create a tibble with professor names.t <-tibble(names =c("dr Mike Smit","dr Sandra Toze"))# I remove all instance of the pattern "dr" in the names. t %>%mutate(names =str_remove_all(names, "dr"))
# A tibble: 2 × 1
names
<chr>
1 " Mike Smit"
2 " Sana Toze"
We can see that again removing all the “dr” patterns from the strings caused a problem because the pattern is also found in the name “Sandra”.
4.3.2 Removing extra spaces
The str_squish() function is a quick and easy way to remove unwanted spaces before or after a string, as well as consecutive spaces within a string.
messy_string <-" My cat just stepped on the spacebar as I was writing this "# Let's print the string to see what it looks likemessy_string
[1] " My cat just stepped on the spacebar as I was writing this "
# Let's squish it!str_squish(messy_string)
[1] "My cat just stepped on the spacebar as I was writing this"
The str_trim()function is similar to str_squish() but allows you to specify which types of extra spaces you wish to remove. However, it only handles trailing spaces at the beginning or end of strings and cannot remove extra spaces extra spaces in the middle of a string.
string <-" hello world "# remove spaces at the beginningstring %>%str_trim("left")
[1] "hello world "
# remove spaces at the end# remove spaces at the beginningstring %>%str_trim("right")
[1] " hello world"
# remove spaces at the beginning and at the endstring %>%str_trim("both")
[1] "hello world"
4.3.3 Combine strings
We already learned how to use the unite() function of the tidyr package to concatenate multiple data frame columns into one. However, the unite() function works only with data frames as input, which can be limiting. The stringr package offers a str_c() function that works with vectors, so it’s good to know how to use both functions.
4.3.3.0.1 Vector example
# I create a vector with first namesfirst_names =c("Bertrum", "Colin", "Louise")# I create a vector with last nameslast_names =c("MacDonald", "Conrad", "Spiteri")# I combined my vectors into a new vector with full namesfull_names <-str_c(first_names, last_names, sep =" ")# I print the vectorprint(full_names)
Another advantage of the str_c() over the unite() function is that it is more flexible in terms of the strings that get concatenated. You could combine the content of two vectors and add any pattern you want to any string.
# I create a tibble with two columns containing first and last names.my_tibble =tibble(first_name =c("Bertrum", "Colin", "Louise"),last_name =c("MacDonald", "Conrad", "Spiteri"))# I add a column to my tibble with full_namesmy_tibble %>%mutate(full_name =str_c(first_name, last_name, sep=" "))
# A tibble: 3 × 3
first_name last_name full_name
<chr> <chr> <chr>
1 Bertrum MacDonald Bertrum MacDonald
2 Colin Conrad Colin Conrad
3 Louise Spiteri Louise Spiteri
# I add a column to my tibble with full_names and include the Dr. pattern at the beginning of the name.my_tibble %>%mutate(full_name =str_c("Dr.", first_name, last_name, sep=" "))
# A tibble: 3 × 3
first_name last_name full_name
<chr> <chr> <chr>
1 Bertrum MacDonald Dr. Bertrum MacDonald
2 Colin Conrad Dr. Colin Conrad
3 Louise Spiteri Dr. Louise Spiteri
4.3.4 Splitting strings
The str_split() function does the same thing as the separate() function that we learned about in chapter 3. They have slightly different syntax and arguments, but the main difference between the two functions is that str_split() works with vectors and returns a list, while separate() works with data frames and returns a data frame. In other words, if you want to split a string contained in a data frame column, you need to use separate(), and if you want to split a character vector into a list of character vectors. the n argument of str_split() allows us to specify the length of the returned vector. The basic syntax is str_split(character_vector, separator).
courses =c("INFO5500, INFO6540, INFO6270","INFO5500","INFO5530, INFO5520")# str_split separates the vectors based on a specified delimiter.# the outcome is a list of three vectors with 3, 1 and 2 elements.courses %>%str_split(", ")
We can also specify the exact number of pieces we want to split the string into with str_split_fixed(). This function does not return a vector but a matrix.
# I split the courses vector into a matrix with 4 columns.courses %>%str_split_fixed(", ",n=4)
The str_flatten() function takes a character vector of length x and concatenates all the elements into a character vector of length 1 (a single string) with a specified separator between the elements. In a sense, it is the opposite of a str_split(). It’s basic syntax is str_flatten(vector, separator)
4.3.4.1.1 Vector example
x <-c("a","b","c")str_flatten(x,"|")
[1] "a|b|c"
4.3.4.1.2 tibble example
Using str_flatten() in a tibble is tricky (we need to use the group_by() function that we briefly mentioned in the previous chapter but haven’t thoroughly explored yet) but also counterintuitive since it likely means that we are taking a tibble in a tidy format and making it untidy.
# Here is a tibblemy_tibble <-tibble(instructor =c("Mongeon, Philippe", "Mongeon, Philippe", "Mongeon, Philippe","Spiteri, Louise","Spiteri, Louise"),course =c("INFO5500","INO6540","INFO6270","INFO6350","INFO6480"))print(my_tibble)
# A tibble: 5 × 2
instructor course
<chr> <chr>
1 Mongeon, Philippe INFO5500
2 Mongeon, Philippe INO6540
3 Mongeon, Philippe INFO6270
4 Spiteri, Louise INFO6350
5 Spiteri, Louise INFO6480
Now I want to flatten my course column so that I have all the courses taught by the same instructor in a single row and separated with a “|”.
# A tibble: 2 × 2
# Groups: instructor [2]
instructor course
<chr> <chr>
1 Mongeon, Philippe INFO5500 | INO6540 | INFO6270
2 Spiteri, Louise INFO6350 | INFO6480
Important
The unique() function at the end of the previous code removes the duplicates that are typically created with the str_flatten() function. You can try it yourself and see what happens when you don’t include the unique() step at the end.
4.3.5 Subsetting strings
4.3.5.1 str_sub
We can retrieve, for example, the first three characters of a string (e.g., a postal code) with the str_sub() function. It’s basic syntax is str_sub(string, start, end)
4.3.5.1.1 Vector example
postal_code <-"B3H 4R2"# get the first three characters of the postal codepostal_code %>%str_sub(1,3)
[1] "B3H"
You can also retrieve the last characters of the string using negative numbers. Let’s get the last three characters of the postal code.
postal_code %>%str_sub(-3,-1)
[1] "4R2"
4.3.5.1.2 Tibble example
# I create my tibble t <-tibble(postal_code =c("B3H 4R2", "B3H 7K7"))# I print my tibblet
# I add two new columms with the first three digits and the last 3 digits of the postal code. t <- t %>%mutate(first_three_digits =str_sub(postal_code, 1, 3),last_three_digits =str_sub(postal_code, -3, -1)) # I print my new tibblet
Noticed how I created two new columns with the same mutate()? You can mutate as many things as you want in a single mutate() function. You simply need to add a comma to separate each mutation.
4.3.5.2 str_subset
The str_sub() function should not be confused with the str_subset() functions that returns the element of a vector that contain a string. It’s basic syntax is str_subset(character_vector, string_to_find)
# I create a vector with course codescourse_codes <-c("INFO5500", "BUSI6500", "MGMT5000", "INFO6270")# I print a vector of course codes that contain the pattern "INFO"str_subset(course_codes, "INFO")
Caution
Note that you should not try to use the str_subset() function with a tibble. It is possible, but requires the combination of multiple functions, and it’s not something that you are likely to need to do anyways.
4.3.6 Locating a pattern in a string
The str_locate() function allows you to find the position of a pattern in a string. This can be useful, for instance, in combination with str_sub() if you want to extract the part of a string that comes before or after the pattern. Let’s explore the str_locate() function with a few examples.
4.3.6.0.1 Vector examples
# I create a string with an emailemail <-"info@somewebsite.ca"# I locat the @ characteremail %>%str_locate("@")
start end
[1,] 5 5
You can see that str_locate() returns a matrix with the beginning and the end of the “@” pattern in the email. If we want to get the part of the strings that come before the “@”, then we can do this:
# I get the first part of the emailstr_sub(email, 1,str_locate(email,"@")[,1]-1)
[1] "info"
We did three things there:
We used 1 as the first argument of str_sub() to specify that we want to extract a subset of the email starting with the 1st character.
We used [,1] to obtain the first column in the matrix, which is where our pattern starts (the 5th position).
We subtracted 1 because we don’t want to print characters 1 to 5, which would be “info@” but characters 1 to 4.
So our statement, in English, would read like this: “extract the subset of the email string that starts at the first position and ends one position before where the”@” pattern is located”.
We can get the part that comes after the pattern “@” like this:
email %>%str_sub(str_locate(email,"@")[,2]+1,-1)
[1] "somewebsite.ca"
This reads as “give me the subset of the email string that starts one position after the location of the”@” pattern (str_locate(email,"@")[,2]+1), and ends with the last character of the string (-1)“. Note that the”,-1” part is optional since, by default, the str_sub() function will output the rest of the string when no end position is provided.
4.3.6.0.2 Tibble example
Let’s just repeat the same example but working with a tibble.
# We create a tibble than contains some emailsmy_tibble <-tibble(emails =c("info@somewebsite.ca","support@datascienceisfun.com"))# We print the tibbleprint(my_tibble)
# Let's make this a bit more complex, and print only the part between the "@" and the "."my_tibble %>%mutate(emails =str_sub(emails, # strint to subsetstr_locate(emails,"@")[,2]+1, # starting positionstr_locate(emails,"\\.")[,1]-1)) # ending position
Rather than extracting parts of strings, or modifying strings, you may just want to test to see if a strings contains a specific pattern and get a logical (TRUE, FALSE) in return.
4.3.7.1 str_detect
The str_detect() function allows us to identify strings that contain a specific pattern. It’s syntax is str_detect(character_vector, string_to_detec).
This can be useful if we want to filter a tibble based on pattern matches. Here’s an example where we have a list of postal codes and would like to keep only those who are in Halifax.
4.3.7.1.2 Tibble example
# I create a tibble with postal codesmy_tibble <-tibble(postal_code =c("B3H 1H5","B3H 382","H2T 1H2","J8P 9R2"))# I print the rows that for which the postal code contains the pattern "B3H"my_tibble %>%filter(str_detect(postal_code,"B3H"))
The str_starts() and str_ends() functions do the same thing as str_detect(), but look for the pattern specifically at the beginning or the end of the strings.
# I create a tibble with postal codest <-tibble(postal_code =c("B3H 1H5","B3H 382","H2T 1H2","J8P 9R2"))# I print the postal codes that begin with "B3H"t %>%filter(str_starts(postal_code, "B3H"))
# I print the postal codes that end with "1H2"t %>%filter(str_ends(postal_code, "1H2"))
# A tibble: 1 × 1
postal_code
<chr>
1 H2T 1H2
4.3.8 Regular expressions (regex)
Regular expressions are a powerful way to search for patterns in text. A full understanding of regex is far beyond the scope of this course, but you should at least be aware of them. Below is a very superficial introduction to regular expressions. The cheat sheet for the stringr (https://github.com/rstudio/cheatsheets/blob/main/strings.pdf) package is a great place to look for guidance on using regular expressions (as well as all other functions in the stringr package, several of which that I didn’t mention in this chapter but might still be useful). It shows a list of the basic character classes, and all the operators that you can use to search for patterns in strings, so remember that it’s there to help you.
4.3.8.1 Literal expressions
In the code examples above, we used several functions of the stringr package to search for patterns in strings (e.g., searching for the pattern “INFO” in a vector of strings.). “INFO” is a literal expression. We can also search for more than one pattern combined with the Boolean operator OR (represented by “|” in a search pattern).
4.3.8.2 Character classes
Character classes allow you to search for a range of characters or types of patterns using character classes (e.g., numbers, punctuation, symbols, letters, or a user specified set or range of characters). These classes are represented by square brackets “[ ]”.
4.3.8.2.1 Example: remove unwanted characters from strings
You can use regular expressions to filter out of a string all the non-alphanumeric characters like this:
[:alnum:] is a character class containing all characters that are alphabetical or numerical (letters and numbers).
[^] means everything but.
So the statement reads: replace everything but alphanumeric characters with a space.
4.3.8.2.2 Example: find sequences of character belonging to specific classes
We can search for specific sequences of character classes, which can be useful to retrieve things like postal codes from a string.
# We create a vector with an addressaddress <-c("5058 King St, Halifax, NS H2T 1J2","427 Queen Avernue, Halifax, NS, B3H1H4") # We extract the postal code from the address address %>%str_extract("[:alpha:][:digit:][:alpha:] ?[:digit:][:alpha:][:digit:]")
[1] "H2T 1J2" "B3H1H4"
The pattern [:alpha:][:digit:][:alpha:] reads as “any letter, followed by any number, followed by any letter”. The [:digit:][:alpha:][:digit:] patters reads as any number, followed by any letter, followed by any number.
You might have noticed that then there is a space and a question mark between my two sets of three character classes. This reads as 0 or 1 space (see the quantifiers section in the stringr cheatsheet). This allows queries to extract postal codes that are written with no space between the two sets of three characters.
4.3.8.2.3 Example: search for spelling variants
Another convenient way of using character classes is when you want to match a word in a text that is or isn’t capitalized. Here’s an example.
# We create a tibble with 2 stringsmy_tibble <-tibble(text =c("Information management is great", "I love information management", "Wayne Gretzy was the best hockey player of all times"))# We print the tibbleprint(my_tibble)
# A tibble: 3 × 1
text
<chr>
1 Information management is great
2 I love information management
3 Wayne Gretzy was the best hockey player of all times
# We select the texts that contain "information management" or "Information management".my_tibble %>%filter(str_detect(text, "[Ii]nformation management"))
# A tibble: 2 × 1
text
<chr>
1 Information management is great
2 I love information management
4.3.8.2.4 Example: combining multiple search terms with “|” (boolean OR)
Instead of using character classes, we could combine multiple search teams with the “|” that represents the Boolean operator OR.
# A tibble: 2 × 1
text
<chr>
1 Information management is great
2 I love information management
This works, but even with just two variants, you can already tell that it makes longer statements to write.
4.3.8.2.5 Example: searching for a range of character
# I create a tibble containing letters from a to gmy_tibble <-tibble(letters =c("a","b","c","d","e","f","g"))# I retrieve rows that contain letters from a to fmy_tibble %>%filter(str_detect(letters,"[a-f]"))
# A tibble: 6 × 1
letters
<chr>
1 a
2 b
3 c
4 d
5 e
6 f
Again, we could have used “a|b|c|d|e|f” but this is less efficient. Here’s a similar example where we have lowercase and uppercase letters.
# I create a tibble containing letters from a to g in lowercase and uppercase.my_tibble <-tibble(letters =c("a","b","c","d","e","f","g","A","B","C","D","E","F","G"))# I retrieve rows that contain the letters a to d in lowercase or uppercasemy_tibble %>%filter(str_detect(letters, "[a-dA-D]"))
# A tibble: 8 × 1
letters
<chr>
1 a
2 b
3 c
4 d
5 A
6 B
7 C
8 D
4.3.8.3 Beware of the dot, it’s a wild card
When matching character patterns, the “.” means any character.
string <-"This is a string"# I extract every characterstr_extract_all(string, ".")[[1]]
[1] "T" "h" "i" "s" " " "i" "s" " " "a" " " "s" "t" "r" "i" "n" "g"
# I replace every character with a spacestring %>%str_replace_all("."," ")
[1] " "
4.3.9 Dealing with special characters in strings
Here are some of the characters that you might come across when working with strings in R. When you want to insert these characters in a string, you need to precede them with the escape character “\”. Here is a table adapted from the stringr cheatsheet.
String
Represents
How to search in a pattern
\.
.
\\.
\!
!
\\!
\?
?
\\?
\(
(
\\(
\)
)
\\)
\{
{
\\{
\}
}
\\}
\n
newline
\\n
\t
tab
\\t
\\
backslash \
\\\\
\’
apostrophe ’
\\’
\”
quotation mark ”
\\”
\`
backtick `
\\`
Here are just a few example to so you can see how R deals with these special characters.
string <-"Dear diary\nWhat is wrong with me\nMy code never works as I entend"# If we just print the string, we see it exactly as written.print(string)
[1] "Dear diary\nWhat is wrong with me\nMy code never works as I entend"
The writeLines() function can be used to print the string where escaped characters are interpreted.
writeLines(string)
Dear diary
What is wrong with me
My code never works as I entend
Let’s read a text file (.txt) in R and see what happens.
url <-"https://pmongeon.github.io/info6270/files/boring_story.txt"# reads the file and produces a vector with one element for each lineread_lines(url)
[1] "This is a \"story\" that I wrote just for the INFO6270 course."
[2] "It's a bit of a boring story, but it's just an example. So please forgive me."
[3] "...and they were happy ever after.\tThe end."
# reads the file and procudes a vector with a single element containing the entire contentread_file(url)
[1] "This is a \"story\" that I wrote just for the INFO6270 course.\nIt's a bit of a boring story, but it's just an example. So please forgive me.\n...and they were happy ever after.\tThe end."
# Let's read the whole file and print it with writeLines()read_file(url) %>%writeLines()
This is a "story" that I wrote just for the INFO6270 course.
It's a bit of a boring story, but it's just an example. So please forgive me.
...and they were happy ever after. The end.
4.3.10 Summary
This chapter introduced you to the stringr package and the general principles of manipulating and matching character patterns in R. The goal was to give you enough of the basics so that you can fix small issues with strings in the data that you might encounter in this course, and in your professional or personal lives.