PassengerId | Survived | Pclass | Sex | Age | Ticket | Fare |
---|---|---|---|---|---|---|
1 | 0 | 3 | male | 22 | A/5 21171 | 7.2500 |
2 | 1 | 1 | female | 38 | PC 17599 | 71.2833 |
3 | 1 | 3 | female | 26 | STON/O2. 3101282 | 7.9250 |
4 | 1 | 1 | female | 35 | 113803 | 53.1000 |
5 | 0 | 3 | male | 35 | 373450 | 8.0500 |
6 | 0 | 3 | male | NA | 330877 | 8.4583 |
3 Reading and tidying data
3.1 Learning objectives
- Import data in R
- Make data tidy with the tidyverse
- Export data
- Use the pipe to write clearer code
3.2 Tidy data
Data is stored in all kinds or places, can be accessed in many different ways, and comes in all kinds of shapes and forms. Therefore, much of the data scientist’s work is related to collecting, processing, and cleaning data to get it ready for analysis.
Tidy data a set of principles adapted from the relational model (those of you who took my data management course will be familiar with the relational model). According to Wickham (2014), those principles are:
Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.
While it may be implicit in principles 1 and 2, it is perhaps worth adding as a fourth principle that each cell should contain a single value. For example, comma-separated strings are not tidy.
Here’s an example of what tidy data looks like taken from the Titanic dataset.
We can see that each row is a single observation (a passenger) and that each column is a single variable, and that no cell contains multiple values.
Now let’s take a look at a dataset that is not tidy.
course | students | grades |
---|---|---|
INFO6270 | Francis, Sam, Amy | 92 (A+), 86 (A), 84 (A-) |
INFO6540 | Francis, Sam, Amy | 77 (B+), 100 (A+), 74(B) |
INFO5500 | Francis, Sam, Amy | 86 (A), 70 (B-), 99 (A+) |
We can see that for each course, students are listed in a single cell, and their grades as well. To make this data tidy, a first thing we might might to do is separate students and their grades so that they each occupy a row, and cells contains only one student or grade.
course | students | grades |
---|---|---|
INFO6270 | Francis | 92 (A+) |
INFO6270 | Francis | 86 (A) |
INFO6270 | Francis | 84 (A-) |
INFO6270 | Sam | 92 (A+) |
INFO6270 | Sam | 86 (A) |
INFO6270 | Sam | 84 (A-) |
INFO6270 | Amy | 92 (A+) |
INFO6270 | Amy | 86 (A) |
INFO6270 | Amy | 84 (A-) |
INFO6540 | Francis | 77 (B+) |
INFO6540 | Francis | 100 (A+) |
INFO6540 | Francis | 74(B) |
INFO6540 | Sam | 77 (B+) |
INFO6540 | Sam | 100 (A+) |
INFO6540 | Sam | 74(B) |
INFO6540 | Amy | 77 (B+) |
INFO6540 | Amy | 100 (A+) |
INFO6540 | Amy | 74(B) |
INFO5500 | Francis | 86 (A) |
INFO5500 | Francis | 70 (B-) |
INFO5500 | Francis | 99 (A+) |
INFO5500 | Sam | 86 (A) |
INFO5500 | Sam | 70 (B-) |
INFO5500 | Sam | 99 (A+) |
INFO5500 | Amy | 86 (A) |
INFO5500 | Amy | 70 (B-) |
INFO5500 | Amy | 99 (A+) |
This is much better, although we still have the issue of the numeric and letter grades being lumped together in a cell. To make this data truly tidy, we want to separate the numeric and the letter grades, like this:
course | students | grade_numeric | grade_letter |
---|---|---|---|
INFO6270 | Francis | 92 | A+ |
INFO6270 | Francis | 86 | A |
INFO6270 | Francis | 84 | A- |
INFO6270 | Sam | 92 | A+ |
INFO6270 | Sam | 86 | A |
INFO6270 | Sam | 84 | A- |
INFO6270 | Amy | 92 | A+ |
INFO6270 | Amy | 86 | A |
INFO6270 | Amy | 84 | A- |
INFO6540 | Francis | 77 | B+ |
INFO6540 | Francis | 100 | A+ |
INFO6540 | Francis | 74 | B |
INFO6540 | Sam | 77 | B+ |
INFO6540 | Sam | 100 | A+ |
INFO6540 | Sam | 74 | B |
INFO6540 | Amy | 77 | B+ |
INFO6540 | Amy | 100 | A+ |
INFO6540 | Amy | 74 | B |
INFO5500 | Francis | 86 | A |
INFO5500 | Francis | 70 | B- |
INFO5500 | Francis | 99 | A+ |
INFO5500 | Sam | 86 | A |
INFO5500 | Sam | 70 | B- |
INFO5500 | Sam | 99 | A+ |
INFO5500 | Amy | 86 | A |
INFO5500 | Amy | 70 | B- |
INFO5500 | Amy | 99 | A+ |
That’s it, now we have a tidy data set of grades! You can read more about tidy data the R for Data Science book Wickham and Grolemund (2016) , or the journal article by Wickham (2014).
3.3 The tidyverse
The tidyverse is a collection of R packages and functions designed to help you make data tidy and work with tidy data. You can read more about the tidy verse and its packages here: https://www.tidyverse.org. The code below loads the tidyverse and returns the list of packages it includes.
library(tidyverse)
tidyverse_packages()
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
In the next sections, we will use a few of the tidyverse packages (but not exclusively) to import data into R from multiple types of source. Because the tidy format is pretty standard and not R-specific, you might often find that the data sets that you will work with are already respecting the tidy principles. But you’ll also come across different data structures and formats, and so we’ll learn how to tidy up data.
3.4 Import data
The following sections show how to load data from different sources and format into R.
Some of the processes and codes in this section are beyond the level of R proficiency that you are expected to have at this point, and even at the end of the course. They were included here so that:
- You can gain awareness of the different ways in which data can be access through R.
- You can gain awareness of the different file formats and data structure that R can handle.
- You can have access to working pieces of code that you can use to get whatever data you need into R.
So if you feel like you are thoroughly understanding all the codes and processes that are show in section 3.4 and its subsections. Do not worry, and focus on the general tasks that these codes are accomplishing.
3.4.1 Delimited file formats
The readr package (https://readr.tidyverse.org/) has a few useful functions for reading delimited file formats like comma-separated values (.csv) and tab-delimited values (.tsv) or any other type of delimiter. Here are a few examples (if you want to run the examples on your computer, you can download the titanic dataset in different formats here.
# Imports a comma-separated file and saves it into a data frame called titanic
<- read_csv("titanic.csv")
titanic
# Same, but with a tab-separated file (works with tab-separated .txt files also)
<- read_tsv("titanic.tsv")
titanic
# Same, but with a txt file in which the columns are separated with a vertical bar.
<- read_delim("titatic.txt", delim="|")
titanic
# Same, but reading the file directly from a URL.
<- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv") titanic
As you can see from the third example above, you can specify any delimiter using the delim argument of the read_delim()
function. You should also note that tab-delimited text files (.txt) are extremely common. You can read these files with the read_tsv()
function even if the file as the .txt extension. Alternatively, you can use the read_delim()
function and use delim = "\t"
.
Often, you will see examples where the path to a file is first stored in an object, so the object name, rather than the whole path, can be used in a function. Like this:
<- "https://pmongeon.github.io/info6270/files/data/titanic.csv"
path <- read_csv(path) titanic
Another useful package is readxl (https://readxl.tidyverse.org/), which has functions to help you read data from Excel files, such as read_xlsx()
.
library(readxl)
<- "https://pmongeon.github.io/info6270/files/data/halifax_weather.xlsx"
path
# This will read the first sheet of the Excel file
<- read_xlsx(path)
halifax_weather
# This will read the second sheet of the Excel file
<- read_xlsx(path, sheet = 2)
halifax_weather
# This will also read the second sheet of the Excel file
<- read_xlsx(path, sheet = "out")
halifax_weather
# If you don't know what the names of the sheets are, you can read them like this
excel_sheets(path)
3.4.2 JSON files
The JSON format is very popular for exchanging information on the web, and is the typical format of the data that we retrieve from APIs (next). However the process for reading a JSON file and reading JSON data from an Application programming interface (API) are slightly different. This is how to convert a JSON file into a data frame using the fromJSON()
function from the jsonlite package (included in the tidyverse).
library(jsonlite)
<- fromJSON("https://pmongeon.github.io/info6270/files/data/public_housing_ns.json") data
Important note: importing and working with JSON files with simple structures is relatively easy. However, reading more complex JSON files might require a little more work. Reading complex JSON structures is beyond the scope of this chapter.
3.4.3 Application programming interface (API)
APIs allow you to interact with computers, servers or software by making making different request such as sending or retrieving data through the web. Some APIs can be used for free and anonymously, others might require that you identify your self with your email, for instance. Finally, some APIs will require that you create an account to obtain an API key for you to use in your code. Fully understanding how APIs work and being proficient with them is beyond the scope of this course, but you should at least know that they exist and that they can be used to collect data for your data science projects. The R package that helps you work with APIs is httr. Below is an example of a request to retrieve data from a free anonymous API.
Important note: You do not need to understand how all of the code works. I mainly want you to have a working piece of code that you can use as a template if you ever need to retrieve data from an API.
# Load the httr package
library(httr)
# Make a request to an API to GET data
= GET("https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json")
data
# Isolate the content part of the information received from the API. This step is
# necessary because the GET request returns additional information about the request
# along with the requested data
= content(data)
data
# Load the data.table package for the rbindlist() function use in the code below.
# You may need to install the package first if you don't have it installed already.
library(data.table)
# Converts the data retrieved from the API call to a tibble
= as_tibble(rbindlist(data)) data
3.4.4 XML files
The xml2 package (also included in the tidyverse) provides a set of functions to work with data in the XML format. The conversion of XML data into a data frame is not so straightforward because of the nested structure of the XML. If you need to import XML data in R, the following code performs a series of steps that convert a XML file from data.novascotia.ca website data into a tibble. We will explore this code in more details further in this chapter, when we look more closely at the unnest()
function.
Important note: the goal here is mainly to provide you with a piece of code that works, so that you can use it as a template to read XML files if you need to and convert them into data frames.
library(xml2)
<- "https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml"
path
= as_list(read_xml(path))
public_housing_xml = as_tibble(public_housing_xml)
public_housing = unnest_longer(public_housing, colnames(public_housing)[1])
public_housing = unnest_wider(public_housing, colnames(public_housing)[1])
public_housing = unnest(public_housing, cols = names(public_housing))
public_housing = unnest(public_housing, cols = names(public_housing))
public_housing = type_convert(public_housing) public_housing
3.4.5 Connections
Important note: This section is a little bit advanced, so please do not feel like you need to master database and file connections at this point in your R journey. Again, I include this here to raise your awareness of connection and give you some working pieces of code you can use if you ever want or need to access data stored in a large file or database.
3.4.5.1 Connect to a file
If you are trying to read a very large file, you may run into issues because the object in your environment are stored in your computer’s memory (RAM). So if your computer has 4GB of RAM, then files larger than 4GB can’t be imported into a data frame. One solution is to create a connection to the file using the file()
function, and then processing the data a certain number of lines at a time with the readLines()
function. Here is an example.
# open the connection to a text file containing data
<- "https://pmongeon.github.io/info6270/files/data/titanic.txt"
path
# open a connection to the file (note that connections are stored as objects)
<- file(path, open="rb")
con
# Read the first 5 lines and print them
readLines(con, n = 5)
[1] "PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked"
[2] "1|0|3|Braund, Mr. Owen Harris|male|22|1|0|A/5 21171|7.25|NA|S"
[3] "2|1|1|Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38|1|0|PC 17599|71.2833|C85|C"
[4] "3|1|3|Heikkinen, Miss. Laina|female|26|0|0|STON/O2. 3101282|7.925|NA|S"
[5] "4|1|1|Futrelle, Mrs. Jacques Heath (Lily May Peel)|female|35|1|0|113803|53.1|C123|S"
# Read the NEXT 5 lines and print them.
readLines(con, n=5)
[1] "5|0|3|Allen, Mr. William Henry|male|35|0|0|373450|8.05|NA|S"
[2] "6|0|3|Moran, Mr. James|male|NA|0|0|330877|8.4583|NA|Q"
[3] "7|0|1|McCarthy, Mr. Timothy J|male|54|0|0|17463|51.8625|E46|S"
[4] "8|0|3|Palsson, Master. Gosta Leonard|male|2|3|1|349909|21.075|NA|S"
[5] "9|1|3|Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)|female|27|0|2|347742|11.1333|NA|S"
# Close the connection to the file.
close(con)
If you wanted to go through the entire file, 50 lines at a time, and do something.
# open the connection to a text file containing data.
<- "https://pmongeon.github.io/info6270/files/data/titanic.txt"
path <- file(path, open="rb")
con
# this will read 50 lines and print them until the end of the file has been reached.
repeat {
<-readLines(con, n = 50)
xif (is_empty(x) == TRUE) {
break
else {
} print(x)
}
}
# Close the connection to the file.
close(con)
3.4.5.2 Connect to a database
You can also connect to practically all types of local or remote databases and servers as long as you have the required credentials. For example, the RMySQL package is great to work with MySQL databases. You can connect to a MySQL database (provided that you have a user name and password and the other information required by the dbConnect()
function). With the RSQLite package, you can also work with local SQL databases stored in a single file on your computer with SQLite. The code below shows you how to connect to MySQL or SQLite databases in R.
## This creates a connection to a MySQL database for which you would need
## access credentials.
library(RMySQL)
= dbConnect(MySQL(),
con user = "username",
password = "password",
host = "host",
port = port_number,
dbname = "database name")
# This opens a connection called db to the info6270 SQLite database.
# If the database doesn't exist, this code will also create it.
library(RSQLite)
<- dbConnect(drv = SQLite(), dbname= "C:/info6270.db") con
Once the connection is established, you can interact with the database with a set of functions, some of which are presented in the following code:
# lists table in the database access through the db connection.
dbListTables(con)
# liste fields from a table
dbListFields(con,"table_name")
# import the results of a SQL query
<- dbGetQuery(con, "SQL query")
data
# import all the data from a table
<- dbReadTable(con, "table name")
data
# upload a data frame to a table (change options as needed)
dbWriteTable(con, data, "table name", row.names=FALSE, overwrite = FALSE, append = TRUE)
3.5 Tidy your data
Now we know how to import data into R. However, not all data comes in a nice and tidy shape, even if the data is already shaped like a table. In this section, we’ll explore how to change the structure of your data frames with tidyr (https://tidyr.tidyverse.org/).
3.5.1 Reshape data
The reshape functions include pivot_longer()
and its opposite, pivot_wider()
. Let’s look at how they work. First we create a tibble with fictional amounts of funding received from the Canadian Tri-Council by some Nova Scotian universities.
# This creates a funding tibble with 4 columns and some data.
<- tibble(university = as.character(c("DAL","SMU","SFX")),
funding SSHRC = as.numeric(sample(1:100,3)),
NSERC = as.numeric(sample(1:100,3)),
CIHR = as.numeric(sample(1:100,3)))
# Note: the sample(1:100,3) function randomly chooses three values between 1 and 100.
funding
# A tibble: 3 × 4
university SSHRC NSERC CIHR
<chr> <dbl> <dbl> <dbl>
1 DAL 70 96 44
2 SMU 8 12 100
3 SFX 59 97 79
3.5.1.1 pivot_longer
The pivot_longer()
function makes your data longer by appending multiple columns together in two column.
= pivot_longer(funding, cols = c("SSHRC","NSERC","CIHR"), names_to = "funder", values_to = "funding_amount")
funding
funding
# A tibble: 9 × 3
university funder funding_amount
<chr> <chr> <dbl>
1 DAL SSHRC 70
2 DAL NSERC 96
3 DAL CIHR 44
4 SMU SSHRC 8
5 SMU NSERC 12
6 SMU CIHR 100
7 SFX SSHRC 59
8 SFX NSERC 97
9 SFX CIHR 79
3.5.1.2 pivot_wider
The pivot_wider()
function does the opposite of pivot_longer()
and takes a column with names and another with values and creates a new column for each name and storing the value in it. The following example will perhaps make this clearer.
= pivot_wider(funding, names_from = funder, values_from = funding_amount)
funding
funding
# A tibble: 3 × 4
university SSHRC NSERC CIHR
<chr> <dbl> <dbl> <dbl>
1 DAL 70 96 44
2 SMU 8 12 100
3 SFX 59 97 79
3.5.2 Expand tables
Let’s create a new funding tibble to explore some more tidyr functions.
<- tibble(university = as.character(c("DAL","DAL","DAL","SMU","SMU","SFX","SFX")),
funding funder = as.character(c("SSHRC","NSERC","CIHR","SSHRC","CIHR","NSERC","CIHR")),
n_grants = as.numeric(sample(1:100, 7)))
print(funding)
# A tibble: 7 × 3
university funder n_grants
<chr> <chr> <dbl>
1 DAL SSHRC 78
2 DAL NSERC 58
3 DAL CIHR 95
4 SMU SSHRC 82
5 SMU CIHR 30
6 SFX NSERC 71
7 SFX CIHR 43
3.5.2.1 expand
The expand()
function creates all possible combinations of the data in two or more columns indicated in the function’s argument and drops other columns. The example below returns the possible combinations of universities and funders in the funding tibble we created. You can see that the SMU - NSERC combination and the SFX - SSHRC combination appear even though they are not present in the original funding tibble.
expand(funding, university, funder)
# A tibble: 9 × 2
university funder
<chr> <chr>
1 DAL CIHR
2 DAL NSERC
3 DAL SSHRC
4 SFX CIHR
5 SFX NSERC
6 SFX SSHRC
7 SMU CIHR
8 SMU NSERC
9 SMU SSHRC
3.5.2.2 complete
The complete()
function does the same thing as the expand()
function but it keeps all the columns of the original tibble.
= complete(funding, university, funder)
funding
print(funding)
# A tibble: 9 × 3
university funder n_grants
<chr> <chr> <dbl>
1 DAL CIHR 95
2 DAL NSERC 58
3 DAL SSHRC 78
4 SFX CIHR 43
5 SFX NSERC 71
6 SFX SSHRC NA
7 SMU CIHR 30
8 SMU NSERC NA
9 SMU SSHRC 82
3.5.3 Handle missing values
Sometimes your dataset is incomplete for one reason or another. Such a reason could be a participant to a survey that did not complete the survey or answer all the questions. What to do with the missing values or incomplete records depends on the nature of the data and the goal of your analysis.
3.5.3.1 drop_na
The drop_na()
function is useful to remove incomplete observations in the data (i.e., rows for which one of the column contains no value. The code below includes a statement that removes all incomplete observations, and one where we specify the columns names that should be scanned for NA values (the rows won’t be deleted if there are NAs in the other columns).
# This removes all incomplete observations
drop_na(funding)
# A tibble: 7 × 3
university funder n_grants
<chr> <chr> <dbl>
1 DAL CIHR 95
2 DAL NSERC 58
3 DAL SSHRC 78
4 SFX CIHR 43
5 SFX NSERC 71
6 SMU CIHR 30
7 SMU SSHRC 82
# This would remove rows where the n_grants column is NA and leave NA values in other
# columns.
drop_na(funding, n_grants)
# A tibble: 7 × 3
university funder n_grants
<chr> <chr> <dbl>
1 DAL CIHR 95
2 DAL NSERC 58
3 DAL SSHRC 78
4 SFX CIHR 43
5 SFX NSERC 71
6 SMU CIHR 30
7 SMU SSHRC 82
3.5.3.2 replace_na
Instead of deleting rows with missing values, we me may wish to replace NAs with some other value (e.g., 0). We can do that with replace_na()
. You have to provide the columns for which you want to replace NAs and the values to replace the NAs with in a list.
replace_na(funding, list(n_grants = 0))
# A tibble: 9 × 3
university funder n_grants
<chr> <chr> <dbl>
1 DAL CIHR 95
2 DAL NSERC 58
3 DAL SSHRC 78
4 SFX CIHR 43
5 SFX NSERC 71
6 SFX SSHRC 0
7 SMU CIHR 30
8 SMU NSERC 0
9 SMU SSHRC 82
If we want to replace NAs in multiple columns, we simply need to put additional column names and values to replace NAs with in the list. The code below replaces the NAs with 0 in two columns of some tibble.
replace_na(some_tibble, list(some_column = 0, some_other_column = 0))
Replacing null values with zeros for a numerical variable may seems logical, and it may be appropriate for your intended purpose. However, NA and 0 are not logically equivalent: NAs implies the absence of data (somethings that we did not or could not observe), and zeros imply an observed or measured value of 0.
3.5.4 Merge and split cells
Let’s create a simple tibble with some names.
<- tibble(first_name = c("Jos","May"),
my_tibble last_name = c("Louis","West"))
print(my_tibble)
# A tibble: 2 × 2
first_name last_name
<chr> <chr>
1 Jos Louis
2 May West
3.5.4.1 unite
In some scenario, it may be more useful to have the first names and last names combined into a full_name column. This can be achieved with the unite()
function.
= unite(my_tibble,
my_tibble "first_name","last_name", # columns to unite.
col = "full_name", # name of the new column.
sep = " ") # Tells R to seperate the first and last name with a space.
print(my_tibble)
# A tibble: 2 × 1
full_name
<chr>
1 Jos Louis
2 May West
3.5.4.2 separate
In the opposite scenario where you have a full_name column but would rather have a first_name and a last_name column, you can use the separate()
function to do that.
= separate(my_tibble,
my_tibble # column to separate.
full_name, sep = " ", # character string to use as separator.
into = c("first_name", "last_name")) # names of the new columns.
print(my_tibble)
# A tibble: 2 × 2
first_name last_name
<chr> <chr>
1 Jos Louis
2 May West
3.5.4.3 separate_rows
The separate_row()
function is similar to the separate()
function, but instead of creating new columns it creates new rows. In the following example, we have a tibble containing two articles for which we have a column title and an authors column in which the authors are separated with a semi-colon.
# This creates some tibble contain names of authors in a single cell with a separator.
= tibble(title = c("awesome article","boring article"),
my_tibble authors = c("Toze, S.; Brown, A.","Smith, J.; Roberts, J."))
print(my_tibble)
# A tibble: 2 × 2
title authors
<chr> <chr>
1 awesome article Toze, S.; Brown, A.
2 boring article Smith, J.; Roberts, J.
This is not tidy data, so we want to separate the authors so that there is only one author per cell. This will create a tidy associative table between articles and authors. Let’s use separate_rows()
to do that.
= separate_rows(my_tibble, authors, sep = "; ")
my_tibble
print(my_tibble)
# A tibble: 4 × 2
title authors
<chr> <chr>
1 awesome article Toze, S.
2 awesome article Brown, A.
3 boring article Smith, J.
4 boring article Roberts, J.
Notice how our separator in the code above was not just a semi-colon, but a semi-colon followed by a space. That is because in the original tibble, there were two characters between the end of an author’s name and the beginning of the other: a semi-colon and a space. If we had just used the semi-colon as our separator in the separate_rows()
, the space would have been kept as the first character of the names of all authors except the first one.
3.5.5 Nested data
The XML file import mentioned earlier is a good opportunity to explore the process of tidying nested data. converting an XML file to the tidy format requires quite a few steps. Here is a pretty standard, step by step process that you can follow whenever you are dealing with XML files.
# load the xml2 package
library(xml2)
# read an xml file
= read_xml("https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml") xml
The xml object is storing the data in its original xml format. You can verify that indeed the file is in XML by writing the xml object to a file and looking at it.
write_xml(xml, file="c:/data/xml.xml")
The xml2 package provides a function (as_list) to convert XML documents into an equivalent R list. Beware: as_list()
should not be confused with as.list()
, which is a completely different, base R, function.
<- as_list(xml) list
The next step is converting that list into a tibble. So let’s do that.
<- as_tibble(list)
public_housing print(public_housing)
# A tibble: 1 × 1
response
<named list>
1 <named list [342]>
At the moment it has only one column called “response” of the “named_list” type, containing a single cell with 342 named list. This is a nested structure, where a single cell contains a list that contains other lists that contain other lists.
3.5.5.1 unnest_longer
So, let’s unnest the data in the response column using the unnest_longer()
function. This function makes the tibble longer by expaning it vertically (adding rows) with the unnested data.
<- unnest_longer(public_housing, response)
public_housing
print(public_housing)
# A tibble: 342 × 2
response response_id
<named list> <chr>
1 <named list [22]> row
2 <named list [22]> row
3 <named list [21]> row
4 <named list [22]> row
5 <named list [22]> row
6 <named list [21]> row
7 <named list [22]> row
8 <named list [22]> row
9 <named list [22]> row
10 <named list [22]> row
# ℹ 332 more rows
Now we’re making progress. We have 342 observations each occupying one line. This is one of the tidy criteria! However, the columns are not right. We can see that there are two columns. The first one appears to contain 342 list of 22 elements, which are probably the 22 variables that we want as our columns.
3.5.5.2 unnest_wider
Let’s unnest these 342 lists using the unnest_wider()
function. This function makes the tibble wider by expaning it horizontally (adding columns) with the unnested data.
<- unnest_wider(public_housing, response)
public_housing
print(public_housing)
# A tibble: 342 × 23
id property_project pid name address city postal_code
<list> <list> <list> <list> <list> <list> <list>
1 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
2 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
3 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <NULL>
4 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
5 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
6 <list [1]> <list [1]> <list [1]> <list [1]> <NULL> <list> <list [1]>
7 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
8 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
9 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
10 <list [1]> <list [1]> <list [1]> <list [1]> <list> <list> <list [1]>
# ℹ 332 more rows
# ℹ 16 more variables: number_of_floors <list>, residential_units <list>,
# housing_authority <list>, county <list>, elevator <list>, oil_heat <list>,
# electric_heat <list>, public_water <list>, well <list>, sewer <list>,
# onsite_septic <list>, municipality <list>, x_coordina <list>,
# y_coordina <list>, location <lgl>, response_id <chr>
we can now see that each cell is a list of 1 element.
3.5.5.3 unnest
Now let’s unnest the data contained in each cell with the unnest()
function.
unnest()
= unnest(public_housing, cols = c(id, property_project, pid, name, address, city, postal_code, number_of_floors, residential_units, housing_authority, county, elevator, oil_heat, electric_heat, public_water, well, sewer, onsite_septic, municipality, x_coordina, y_coordina)) public_housing
This seems silly… there has to be a leaner way of writing this. Yes, there is. The names()
returns a vector of the column names of a data frame. So we can use this much leaner code:
#create the vector with the column names
= names(public_housing)
colnames
#use that vector for the cols argument of the unnest() function.
= unnest(public_housing, cols = colnames)
public_housing
print(public_housing)
# A tibble: 342 × 23
id property_project pid name address city postal_code number_of_floors
<lis> <list> <lis> <lis> <list> <lis> <list> <list>
1 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
2 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
3 <chr> <chr [1]> <chr> <chr> <chr> <chr> <NULL> <chr [1]>
4 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
5 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
6 <chr> <chr [1]> <chr> <chr> <NULL> <chr> <chr [1]> <chr [1]>
7 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
8 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
9 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
10 <chr> <chr [1]> <chr> <chr> <chr> <chr> <chr [1]> <chr [1]>
# ℹ 332 more rows
# ℹ 15 more variables: residential_units <list>, housing_authority <list>,
# county <list>, elevator <list>, oil_heat <list>, electric_heat <list>,
# public_water <list>, well <list>, sewer <list>, onsite_septic <list>,
# municipality <list>, x_coordina <list>, y_coordina <list>, location <lgl>,
# response_id <chr>
We now have a tibble with 342 observation of 22 variables. Looks good! But the values in the cells still don’t look right. There’s just one more round of unnesting to apply to all the cell using the same code as in the last step.
= unnest(public_housing, cols = colnames)
public_housing
print(public_housing)
# A tibble: 342 × 23
id property_project pid name address city postal_code number_of_floors
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 117 330101X 6503… Rive… 19/27/… Pict… B0K 1S0 2
2 324 200214X 5006… Bell… 10 Gra… Gran… B0E 1L0 2
3 247 100215X 1511… Sydn… 77 Ain… Sydn… <NA> 1
4 337 390201X 2522… Elgi… 22 Elg… Spri… B0M 1X0 1
5 224 100201X 1513… Will… 10 Wil… Sydn… B1N 1R4 2
6 289 150226X 1524… 198 … <NA> Flor… B0C 1J0 2
7 214 612301 9021… Trin… 3 Trin… Yarm… B5A 1P3 2
8 143 350801 2017… Youn… 130 Yo… Truro B2N 3X4 2
9 104 740101 6018… Ritc… 3809 H… Rive… B0J 2W0 2
10 174 430901 4017… Dr. … 3792 N… Hali… B3K 3G5 9
# ℹ 332 more rows
# ℹ 15 more variables: residential_units <chr>, housing_authority <chr>,
# county <chr>, elevator <chr>, oil_heat <chr>, electric_heat <chr>,
# public_water <chr>, well <chr>, sewer <chr>, onsite_septic <chr>,
# municipality <chr>, x_coordina <chr>, y_coordina <chr>, location <lgl>,
# response_id <chr>
Wow. Now it looks really good! But wait… there’s one more thing. Look at the data types of the columns. They are all characters, which doesn’t seem right because some of our columns appear to contain only numerical data. The readr package comes to the rescue with its easy to use type_convert()
function.
= type_convert(public_housing)
public_housing
print(public_housing)
# A tibble: 342 × 23
id property_project pid name address city postal_code
<dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 117 330101X 65032708 Riverton Heights I 19/27/… Pict… B0K 1S0
2 324 200214X 50067362 Bellevue 10 Gra… Gran… B0E 1L0
3 247 100215X 15113962 Sydney Senior Dupl… 77 Ain… Sydn… <NA>
4 337 390201X 25227687 Elgin Street Villa 22 Elg… Spri… B0M 1X0
5 224 100201X 15130651 William Street S/C 10 Wil… Sydn… B1N 1R4
6 289 150226X 15242738 198 Pitt Street <NA> Flor… B0C 1J0
7 214 612301 90212812 Trinity Place 3 Trin… Yarm… B5A 1P3
8 143 350801 20178653 Young St. Lodge 130 Yo… Truro B2N 3X4
9 104 740101 60187283 Ritcey's Cove Mano… 3809 H… Rive… B0J 2W0
10 174 430901 40177982 Dr. Prince Manor (… 3792 N… Hali… B3K 3G5
# ℹ 332 more rows
# ℹ 16 more variables: number_of_floors <dbl>, residential_units <dbl>,
# housing_authority <chr>, county <chr>, elevator <chr>, oil_heat <chr>,
# electric_heat <chr>, public_water <chr>, well <chr>, sewer <chr>,
# onsite_septic <chr>, municipality <chr>, x_coordina <dbl>,
# y_coordina <dbl>, location <lgl>, response_id <chr>
We have now converted an XML document into an easy to use tidy dataset! Here is the entire process that we just went through in a single chunk of code.
= read_xml("https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml")
xml = as_list(xml)
list = as_tibble(list)
public_housing = unnest_longer(public_housing, col = response)
public_housing = unnest_wider(public_housing, col = response)
public_housing = unnest(public_housing, cols = names(public_housing))
public_housing = unnest(public_housing, cols = names(public_housing))
public_housing = type_convert(public_housing) public_housing
3.5.6 Combine data frames
Sometimes the data will come in separate files so you will have to combine the pieces into a single and tidy dataset.
3.5.6.1 bind_rows
The bind_rows()
function concatenates data frames vertically, adding new columns when column names differ between data frames.
# Create two tibbles with student info
<- tibble(student_id = c(1,2,3),
students1 name = c("Francis","Sam","Amy"))
<- tibble(name = c("Sue","Bill","Lucy"),
students2 email = c("Sue123@fakemail.com","bill@info620.com","lucy@lucy.org"))
# Apply bind_rows function
bind_rows(students1, students2)
# A tibble: 6 × 3
student_id name email
<dbl> <chr> <chr>
1 1 Francis <NA>
2 2 Sam <NA>
3 3 Amy <NA>
4 NA Sue Sue123@fakemail.com
5 NA Bill bill@info620.com
6 NA Lucy lucy@lucy.org
3.5.6.2 bind_cols
The bind_cols()
function concatenates the entire data frames horizontally. The number of rows in the two databases must be the same.
# Create a tibble with student info
<- tibble(student_id = c(4,5,6),
students3 name = c("Bill","Lucy","Sue"))
# Apply bind_rows function
bind_cols(students3, students2)
# A tibble: 3 × 4
student_id name...2 name...3 email
<dbl> <chr> <chr> <chr>
1 4 Bill Sue Sue123@fakemail.com
2 5 Lucy Bill bill@info620.com
3 6 Sue Lucy lucy@lucy.org
The result is a tibble with four columns, the two columns from the students2 tibble and the two columns from the students3 tibble. So the function did it’s job. However, notice how the order of the students were not the same in both tibble? Also notice how there are now two columns with the names of the students? That’s because bind_col()
does nothing more than sticking the two tibbles together without making any attempts to connect the data from one tibble to the data in the other tibble (e.g. matching the name so that the student_id and the emails are accurately paired). This also means that bind_cols()
requires that both tibbles have the same number of rows, otherwise we get an error.
An often better way of bringing together two tibbles is to use joins.
3.5.7 Join data frames
The join family of functions is used to combine rows and columns of data frames based on a matching criteria. This means that the number of rows and their order in the joined tibbles does not matter.
Let’s create two new tibble to illustrate how the joins work
<- tibble(name = c("Colin","Sam","Dominika","Alana"),
instructors_schools school = c("SIM","RSB","SPA","SRES"))
<- tibble(name = c("Colin", "Sam", "Sandy", "Lindsey"),
instructors_emails email = c("cc@dalhousie.uni","st@dalhousie.uni","ss@dalhousie.uni","lm@dalhousie.uni"))
3.5.7.1 left_join
The left_join()
function returns all records from the first data frame and only records from the second data frame for which the matching condition is TRUE. Here is an example where we start with a list of instructors and their school (instructors_schools tibble) and want to obtain their email (instructors_emails tibble).
left_join(instructors_schools, instructors_emails, by="name")
# A tibble: 4 × 3
name school email
<chr> <chr> <chr>
1 Colin SIM cc@dalhousie.uni
2 Sam RSB st@dalhousie.uni
3 Dominika SPA <NA>
4 Alana SRES <NA>
Notice how the name column is not duplicated? That is because the name is the key used for the matching of the two tibbles.
3.5.7.2 right_join
The right_join()
function is returns all records from the second data frame and only records from the first data frame for which the matching condition is TRUE. Here is an example:
right_join(instructors_schools, instructors_emails, by="name")
# A tibble: 4 × 3
name school email
<chr> <chr> <chr>
1 Colin SIM cc@dalhousie.uni
2 Sam RSB st@dalhousie.uni
3 Sandy <NA> ss@dalhousie.uni
4 Lindsey <NA> lm@dalhousie.uni
Since we used a right_join()
here, R kept all rows from the right table (the second one listed in the function’s body), and kept only the names and schools of the instructors that were found in the instructors_emails tibble.
3.5.7.3 inner_join
The inner_join()
function returns only records for which the matching condition is TRUE. It excludes all records that were not matched from both tibbles. Here is an example:
inner_join(instructors_schools, instructors_emails, by="name")
# A tibble: 2 × 3
name school email
<chr> <chr> <chr>
1 Colin SIM cc@dalhousie.uni
2 Sam RSB st@dalhousie.uni
Colin and Sam were the only instructors for which we had both a school and an email.
3.5.7.4 anti_join
The anti_join()
function is useful if you want to exclude observations (rows) in your tibble when the matching condition is TRUE.
anti_join(instructors_schools, instructors_emails, by="name")
# A tibble: 2 × 2
name school
<chr> <chr>
1 Dominika SPA
2 Alana SRES
3.5.7.5 full_join
The full_join()
function is essentially a combination of the left_join()
and right_join()
functions. It will keep all the rows from both tibbles that are being joined.
full_join(instructors_schools, instructors_emails, by="name")
# A tibble: 6 × 3
name school email
<chr> <chr> <chr>
1 Colin SIM cc@dalhousie.uni
2 Sam RSB st@dalhousie.uni
3 Dominika SPA <NA>
4 Alana SRES <NA>
5 Sandy <NA> ss@dalhousie.uni
6 Lindsey <NA> lm@dalhousie.uni
Note: You can use the functions without specifying the column(s) to use for the matching. In this case, all columns with the same names will be used for the matching. Because te name column is the only one that exists in both the tibbles used in this example, the result of the following code is the same as in our inner_join()
example above.
inner_join(instructors_schools, instructors_emails)
# A tibble: 2 × 3
name school email
<chr> <chr> <chr>
1 Colin SIM cc@dalhousie.uni
2 Sam RSB st@dalhousie.uni
3.5.8 Rename columns
For various reasons, we may want to rename columns in our tibble. We can do this with the rename()
function. Let’s capitalize the first letter of the column names from our instructors_schools tibble.
Important: The syntax of the rename()
function goes new_column_name = orginal_column_name
.
<- rename(instructors_schools, Name = name, School = school)
instructors_schools
print(instructors_schools)
# A tibble: 4 × 2
Name School
<chr> <chr>
1 Colin SIM
2 Sam RSB
3 Dominika SPA
4 Alana SRES
3.5.9 Select columns
The select()
variable is used to retrieve a subset of columns from a data frame or tibble.
# Select the School column from the instructors_schools tibble.
select(instructors_schools, School)
# A tibble: 4 × 1
School
<chr>
1 SIM
2 RSB
3 SPA
4 SRES
We can select and rename columns at the same time, like this:
# Select the School column from the instructors_schools tibble
# and rename it to Department.
select(instructors_schools, Department = School)
# A tibble: 4 × 1
Department
<chr>
1 SIM
2 RSB
3 SPA
4 SRES
3.5.10 Filter rows
The filter()
functions is used to retrieve a subset of rows from a data frame or tibble based on some criteria. For example, we could filter a tibble and include only the rows here x1 is lower than 3.
filter(instructors_schools, School == "RSB")
# A tibble: 1 × 2
Name School
<chr> <chr>
1 Sam RSB
# you can use the %in% operator to select observations for which the value is in a list.
We can also use the %in% operator to select observations for which the value is in a vector.
filter(instructors_schools, School %in% c("RSB","SPA"))
# A tibble: 2 × 2
Name School
<chr> <chr>
1 Sam RSB
2 Dominika SPA
3.5.11 Modify or create new variables
The mutate()
function is used to create a new variable or update an existing one.
# This adds a column called Faculty that contains the string "Faculty of Management".
mutate(instructors_schools, Faculty = "Faculty of Management")
# A tibble: 4 × 3
Name School Faculty
<chr> <chr> <chr>
1 Colin SIM Faculty of Management
2 Sam RSB Faculty of Management
3 Dominika SPA Faculty of Management
4 Alana SRES Faculty of Management
# This updates the already existing variable Faculty with the new value "FoM".
mutate(instructors_schools, Faculty = "FoM")
# A tibble: 4 × 3
Name School Faculty
<chr> <chr> <chr>
1 Colin SIM FoM
2 Sam RSB FoM
3 Dominika SPA FoM
4 Alana SRES FoM
Note that you can also calculate the value assign to a variable with the mutate()
function.
You can also use conditional statements in the mutate()
function. For example, say we want to add a new column with the name of the schools spelled out. Of course, the values in that new column will depend on the value in the School column. Since there are more than two schools, we should use the case_when()
function, like in the following example.
<- mutate(instructors_schools,
instructors_schools school_name = case_when(School == "RSB" ~ "Rowe School of Business",
== "SIM" ~ "School of Information Management",
School == "SPA" ~ "School of Public Administration",
School == "SRES" ~ "School for Resources and Environmental Studies")) School
We can also use ifelse(condition, value if TRUE, value if FALSE)
when there is only two possible outcomes. For example, say some of our instructors are on leave, and we want to add a column to our instructors_schools tibble that indicates whether the instructor is on leave or not. We could do it this way.
# vector of people on leave
<- c("Dominika","Colin")
instructors_on_leave
# Creates a variable called on_leave that is 1 if the person is on leave and 0 otherwise
<- mutate(instructors_schools,
instructors_schools on_leave = ifelse(Name %in% instructors_on_leave,1,0))
print(instructors_schools)
# A tibble: 4 × 4
Name School school_name on_leave
<chr> <chr> <chr> <dbl>
1 Colin SIM School of Information Management 1
2 Sam RSB Rowe School of Business 0
3 Dominika SPA School of Public Administration 1
4 Alana SRES School for Resources and Environmental Studies 0
3.6 Export data
Now that you have a nice tidy data set, you may want to export it so you can save it for future use and not have to repeat the data collection and tidying process, or maybe because you want to share it with others.
3.6.1 Export data frames
Exporting data frames is very easy in R (the hardest part was getting there). The readr package has a bunch of functions to write files. The most common are:
write_csv()
for comma-separated files (and it’s cousinwrite_csv2()
that uses the european format (commas instead of dot for decimals, and semicolon as delimiter);write_tsv()
for tab-delimited files;write_delim()
which allows you to specify the separator (it can be anything you want, even the word “banana”)write_xlsx()
from the writexl package can be used to write Excel files.
Here’s how we would use these functions to write data to a file. Note that, in the example, we use the col_names = TRUE
argument to ensure that the first line of the created files will contain the column names. the default setting of write_xlsx()
bolds and centers the column names. In the example below, we chose to disable this feature with the format_headers = FALSE
argument.
library(readr)
write_csv(my_tibble, file = "my_data.csv", col_names = TRUE)
write_csv2(my_tibble, file = "my_data.csv", col_names = TRUE)
write_tsv(my_tibble, file = "my_data.tsv", col_names = TRUE)
write_delim(my_tibble, file = "my_data.txt", delim = "\t", col_names = TRUE)
library(writexl)
write_xlsx(my_tibble, path = "my_data.xlsx", col_names = TRUE, format_headers = FALSE)
The write_xlsx()
function can also write multiple tibbles into different sheets of a single Excel file, which can be quite useful. To do that, we need to provide a list of tibbles rather than a single tibble as the intput to the function.
Assuming that we have already created several tibbles and named them my_tibble1, my_tibble2, and my_tibble3. Here’s how we would create a list containing the three tibbles and export it into an Excel file.
# Create a list of tibbles tibbles
<- list(my_tibble1, my_tibble2, my_tibble3)
my_tibble_list
write_xlsx(my_tibble_list, path="my_data.xlsx", col_names = TRUE, format_headers = FALSE)
3.6.2 JSON
Writing JSON files from a data frame with the toJSON()
function is just as easy as reading them with the fromJSON()
function. The following code reads a CSV file containing the Titanic dataset, and then exports the data to a JSON file. Note the that pretty=TRUE
argument spaces out the JSON file so it is more readable for humans.
# Reads the CSV file into a data frame.
<- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")
titanic
# Writes the data frame to a JSON file.
write_file(toJSON(titanic, pretty=TRUE), "titanic.json")
3.7 Writing readable code
There are many ways of writing code to achieve the same process. You can structure the data in different ways, use different functions from different packages, but we can also use the exact same functions in exactly the same way but with very different code. Most of our examples so far were relatively simple and the processes did not involve a large number of steps. However, when they did, most of the steps were coded sequentially, one statement at a time. This makes all the steps explicit and clear, but it also makes the code longer than it needs to be. We can also write our code so that the entire process is in a single statement that contains other statements, that contain other statements, an so on. Let’s use an example to compare the two approaches. Say we want to use the Titanic dataset to create a CSV file that contains only the name, sex, age of the survivors. The steps would be:
- Load the Titanic dataset.
- Convert it into a tibble (this step is not actually necessary, we use it to add an extra step to the process).
- Filter the dataset to include only the survivors.
- Select only the Name, Sex, and Age columns.
- Write the tibble to a .csv file.
We can write this in a single statement, like this.
write_csv(select(filter(as_tibble(read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")),Survived == 1), Name,Sex,Age), file = "titanic_survivors.csv", col_names = TRUE)
Just like in mathematical equations, the operations are completed from the inside out. So the first step, read_csv("titanic")
is in the middle, and it is inside the statement for the second step, as_tibble()
, which is inside the third statemnt, filter(, Survived == 1)
, and so on. The outer statement is the write_csv()
function with its argument, file = "titanic_survivors.csv, col_names = TRUE
, appearing at the very end of the code.
This may be efficient in terms of the number of characters needed to write the code, but perhaps a little difficult to read.
Now lets do the same thing by writing distinct statements for each step.
# Load the Titanic dataset.
<- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")
data
# Convert it into a tibble
<- as_tibble(data)
data
# Filter the dataset to include only the survivors.
<- filter(data, Survived == 1)
data
# Select only the Name, Sex, and Age columns.
<- select(data, Name, Sex, Age)
data
# Write the tibble to a .csv file.
write_csv(data, "titanic_survivors.csv")
This is very clear but perhaps not so efficient in terms of characters (the name of the tibble gets repeated many times). Also, all five statement will need to be executed one by one, or we will need to select all the code to run it all at once. This is perhaps not such a big deal, but the single statement version does have an edge there.
3.7.1 The pipe
The dplyr and maggritr packages (automatically loaded with the tidyverse) include an extremely useful and popular operator called the pipe which looks like this in R: %>%
. The keyboard shortcut is CTRL+SHIFT+M
. The pipe is a wonderful tool for building lean code that are easier to read and to debug. It allows you to write multiple steps sequentially into a single statement. It combines the readability of the step by step approach with the efficiency of the single “embedded statement” approach. Here’s the same process written with the pipe.
read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv") %>%
as_tibble() %>%
filter(Survived == 1) %>%
select(Name, Sex, Age) %>%
write_csv("titanic_survivors.csv", col_names = TRUE)
When reading this code, I like to read the %>%
as “and then”. So this would read, read the CSV, and then make it a tibble, and then limit the tibble to rows for which the Survived column contains 1, and then select only the Name, Sex, and Age columns, and then write the tibble to a file named “titanic_survivors.csv”.
As you can see, this is a lean and easily readable process which doesn’t require to create an object and explicitly read it, process it, and overwrite it for each step of the process. Note that having each step on a separate line is common and good practice but it is not required. In fact, R does not care about extra spaces, indentations, or line breaks in the code. We could have written this statement in a single line like this.
read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv") %>% as_tibble() %>% filter(Survived == 1) %>% select(Name, Sex, Age) %>% write_csv("titanic_survivors.csv", col_names = TRUE)
3.8 Additional resources
There are many cheatsheets available for R packages that can be very useful to quickly see what the different functions of the packages can do.