3  Reading and tidying data

3.1 Learning objectives

  • Import data in R
  • Make data tidy with the tidyverse
  • Export data
  • Use the pipe to write clearer code

3.2 Tidy data

Data is stored in all kinds or places, can be accessed in many different ways, and comes in all kinds of shapes and forms. Therefore, much of the data scientist’s work is related to collecting, processing, and cleaning data to get it ready for analysis.

Tidy data a set of principles adapted from the relational model (those of you who took my data management course will be familiar with the relational model). According to Wickham (2014), those principles are:

  1. Each variable forms a column.

  2. Each observation forms a row.

  3. Each type of observational unit forms a table.

While it may be implicit in principles 1 and 2, it is perhaps worth adding as a fourth principle that each cell should contain a single value. For example, comma-separated strings are not tidy.

Here’s an example of what tidy data looks like taken from the Titanic dataset.

PassengerId Survived Pclass Sex Age Ticket Fare
1 0 3 male 22 A/5 21171 7.2500
2 1 1 female 38 PC 17599 71.2833
3 1 3 female 26 STON/O2. 3101282 7.9250
4 1 1 female 35 113803 53.1000
5 0 3 male 35 373450 8.0500
6 0 3 male NA 330877 8.4583

We can see that each row is a single observation (a passenger) and that each column is a single variable, and that no cell contains multiple values.

Now let’s take a look at a dataset that is not tidy.

course students grades
INFO6270 Francis, Sam, Amy 92 (A+), 86 (A), 84 (A-)
INFO6540 Francis, Sam, Amy 77 (B+), 100 (A+), 74(B)
INFO5500 Francis, Sam, Amy 86 (A), 70 (B-), 99 (A+)

We can see that for each course, students are listed in a single cell, and their grades as well. To make this data tidy, a first thing we might might to do is separate students and their grades so that they each occupy a row, and cells contains only one student or grade.

course students grades
INFO6270 Francis 92 (A+)
INFO6270 Francis 86 (A)
INFO6270 Francis 84 (A-)
INFO6270 Sam 92 (A+)
INFO6270 Sam 86 (A)
INFO6270 Sam 84 (A-)
INFO6270 Amy 92 (A+)
INFO6270 Amy 86 (A)
INFO6270 Amy 84 (A-)
INFO6540 Francis 77 (B+)
INFO6540 Francis 100 (A+)
INFO6540 Francis 74(B)
INFO6540 Sam 77 (B+)
INFO6540 Sam 100 (A+)
INFO6540 Sam 74(B)
INFO6540 Amy 77 (B+)
INFO6540 Amy 100 (A+)
INFO6540 Amy 74(B)
INFO5500 Francis 86 (A)
INFO5500 Francis 70 (B-)
INFO5500 Francis 99 (A+)
INFO5500 Sam 86 (A)
INFO5500 Sam 70 (B-)
INFO5500 Sam 99 (A+)
INFO5500 Amy 86 (A)
INFO5500 Amy 70 (B-)
INFO5500 Amy 99 (A+)

This is much better, although we still have the issue of the numeric and letter grades being lumped together in a cell. To make this data truly tidy, we want to separate the numeric and the letter grades, like this:

course students grade_numeric grade_letter
INFO6270 Francis 92 A+
INFO6270 Francis 86 A
INFO6270 Francis 84 A-
INFO6270 Sam 92 A+
INFO6270 Sam 86 A
INFO6270 Sam 84 A-
INFO6270 Amy 92 A+
INFO6270 Amy 86 A
INFO6270 Amy 84 A-
INFO6540 Francis 77 B+
INFO6540 Francis 100 A+
INFO6540 Francis 74 B
INFO6540 Sam 77 B+
INFO6540 Sam 100 A+
INFO6540 Sam 74 B
INFO6540 Amy 77 B+
INFO6540 Amy 100 A+
INFO6540 Amy 74 B
INFO5500 Francis 86 A
INFO5500 Francis 70 B-
INFO5500 Francis 99 A+
INFO5500 Sam 86 A
INFO5500 Sam 70 B-
INFO5500 Sam 99 A+
INFO5500 Amy 86 A
INFO5500 Amy 70 B-
INFO5500 Amy 99 A+

That’s it, now we have a tidy data set of grades! You can read more about tidy data the R for Data Science book Wickham and Grolemund (2016) , or the journal article by Wickham (2014).

3.3 The tidyverse

The tidyverse is a collection of R packages and functions designed to help you make data tidy and work with tidy data. You can read more about the tidy verse and its packages here: https://www.tidyverse.org. The code below loads the tidyverse and returns the list of packages it includes.

library(tidyverse)
tidyverse_packages()
 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"    

In the next sections, we will use a few of the tidyverse packages (but not exclusively) to import data into R from multiple types of source. Because the tidy format is pretty standard and not R-specific, you might often find that the data sets that you will work with are already respecting the tidy principles. But you’ll also come across different data structures and formats, and so we’ll learn how to tidy up data.

3.4 Import data

The following sections show how to load data from different sources and format into R.

Important note

Some of the processes and codes in this section are beyond the level of R proficiency that you are expected to have at this point, and even at the end of the course. They were included here so that:

  1. You can gain awareness of the different ways in which data can be access through R.
  2. You can gain awareness of the different file formats and data structure that R can handle.
  3. You can have access to working pieces of code that you can use to get whatever data you need into R.

So if you feel like you are thoroughly understanding all the codes and processes that are show in section 3.4 and its subsections. Do not worry, and focus on the general tasks that these codes are accomplishing.

3.4.1 Delimited file formats

The readr package (https://readr.tidyverse.org/) has a few useful functions for reading delimited file formats like comma-separated values (.csv) and tab-delimited values (.tsv) or any other type of delimiter. Here are a few examples (if you want to run the examples on your computer, you can download the titanic dataset in different formats here.

# Imports a comma-separated file and saves it into a data frame called titanic
titanic <- read_csv("titanic.csv")

# Same, but with a tab-separated file (works with tab-separated .txt files also)
titanic <- read_tsv("titanic.tsv")

# Same, but with a txt file in which the columns are separated with a vertical bar.
titanic <- read_delim("titatic.txt", delim="|") 

# Same, but reading the file directly from a URL.
titanic <- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")

As you can see from the third example above, you can specify any delimiter using the delim argument of the read_delim() function. You should also note that tab-delimited text files (.txt) are extremely common. You can read these files with the read_tsv() function even if the file as the .txt extension. Alternatively, you can use the read_delim() function and use delim = "\t".

Often, you will see examples where the path to a file is first stored in an object, so the object name, rather than the whole path, can be used in a function. Like this:

path <- "https://pmongeon.github.io/info6270/files/data/titanic.csv"
titanic <- read_csv(path)

Another useful package is readxl (https://readxl.tidyverse.org/), which has functions to help you read data from Excel files, such as read_xlsx().

library(readxl)

path <- "https://pmongeon.github.io/info6270/files/data/halifax_weather.xlsx"

# This will read the first sheet of the Excel file
halifax_weather <- read_xlsx(path)

# This will read the second sheet of the Excel file
halifax_weather <- read_xlsx(path, sheet = 2)

# This will also read the second sheet of the Excel file 
halifax_weather <- read_xlsx(path, sheet = "out")

# If you don't know what the names of the sheets are, you can read them like this 
excel_sheets(path)

3.4.2 JSON files

The JSON format is very popular for exchanging information on the web, and is the typical format of the data that we retrieve from APIs (next). However the process for reading a JSON file and reading JSON data from an Application programming interface (API) are slightly different. This is how to convert a JSON file into a data frame using the fromJSON() function from the jsonlite package (included in the tidyverse).

library(jsonlite)
data <- fromJSON("https://pmongeon.github.io/info6270/files/data/public_housing_ns.json")

Important note: importing and working with JSON files with simple structures is relatively easy. However, reading more complex JSON files might require a little more work. Reading complex JSON structures is beyond the scope of this chapter.

3.4.3 Application programming interface (API)

APIs allow you to interact with computers, servers or software by making making different request such as sending or retrieving data through the web. Some APIs can be used for free and anonymously, others might require that you identify your self with your email, for instance. Finally, some APIs will require that you create an account to obtain an API key for you to use in your code. Fully understanding how APIs work and being proficient with them is beyond the scope of this course, but you should at least know that they exist and that they can be used to collect data for your data science projects. The R package that helps you work with APIs is httr. Below is an example of a request to retrieve data from a free anonymous API.

Important note: You do not need to understand how all of the code works. I mainly want you to have a working piece of code that you can use as a template if you ever need to retrieve data from an API.

# Load the httr package
library(httr)

# Make a request to an API to GET data
data = GET("https://openlibrary.org/api/books?bibkeys=OLID:OL22123296M&format=json")

# Isolate the content part of the information received from the API. This step is
# necessary because the GET request returns additional information about the request 
# along with the requested data 
data = content(data)

# Load the data.table package for the rbindlist() function use in the code below.
# You may need to install the package first if you don't have it installed already.
library(data.table)

# Converts the data retrieved from the API call to a tibble
data = as_tibble(rbindlist(data))

3.4.4 XML files

The xml2 package (also included in the tidyverse) provides a set of functions to work with data in the XML format. The conversion of XML data into a data frame is not so straightforward because of the nested structure of the XML. If you need to import XML data in R, the following code performs a series of steps that convert a XML file from data.novascotia.ca website data into a tibble. We will explore this code in more details further in this chapter, when we look more closely at the unnest() function.

Important note: the goal here is mainly to provide you with a piece of code that works, so that you can use it as a template to read XML files if you need to and convert them into data frames.

library(xml2)

path <- "https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml"

public_housing_xml = as_list(read_xml(path))
public_housing = as_tibble(public_housing_xml) 
public_housing = unnest_longer(public_housing, colnames(public_housing)[1])
public_housing = unnest_wider(public_housing, colnames(public_housing)[1]) 
public_housing = unnest(public_housing, cols = names(public_housing))  
public_housing = unnest(public_housing, cols = names(public_housing))
public_housing = type_convert(public_housing)

3.4.5 Connections

Important note: This section is a little bit advanced, so please do not feel like you need to master database and file connections at this point in your R journey. Again, I include this here to raise your awareness of connection and give you some working pieces of code you can use if you ever want or need to access data stored in a large file or database.

3.4.5.1 Connect to a file

If you are trying to read a very large file, you may run into issues because the object in your environment are stored in your computer’s memory (RAM). So if your computer has 4GB of RAM, then files larger than 4GB can’t be imported into a data frame. One solution is to create a connection to the file using the file() function, and then processing the data a certain number of lines at a time with the readLines() function. Here is an example.

# open the connection to a text file containing data
path <- "https://pmongeon.github.io/info6270/files/data/titanic.txt"

# open a connection to the file (note that connections are stored as objects)
con <- file(path, open="rb")

# Read the first 5 lines and print them
readLines(con, n = 5)
[1] "PassengerId|Survived|Pclass|Name|Sex|Age|SibSp|Parch|Ticket|Fare|Cabin|Embarked"               
[2] "1|0|3|Braund, Mr. Owen Harris|male|22|1|0|A/5 21171|7.25|NA|S"                                 
[3] "2|1|1|Cumings, Mrs. John Bradley (Florence Briggs Thayer)|female|38|1|0|PC 17599|71.2833|C85|C"
[4] "3|1|3|Heikkinen, Miss. Laina|female|26|0|0|STON/O2. 3101282|7.925|NA|S"                        
[5] "4|1|1|Futrelle, Mrs. Jacques Heath (Lily May Peel)|female|35|1|0|113803|53.1|C123|S"           
# Read the NEXT 5 lines and print them.
readLines(con, n=5) 
[1] "5|0|3|Allen, Mr. William Henry|male|35|0|0|373450|8.05|NA|S"                              
[2] "6|0|3|Moran, Mr. James|male|NA|0|0|330877|8.4583|NA|Q"                                    
[3] "7|0|1|McCarthy, Mr. Timothy J|male|54|0|0|17463|51.8625|E46|S"                            
[4] "8|0|3|Palsson, Master. Gosta Leonard|male|2|3|1|349909|21.075|NA|S"                       
[5] "9|1|3|Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)|female|27|0|2|347742|11.1333|NA|S"
# Close the connection to the file.
close(con)

If you wanted to go through the entire file, 50 lines at a time, and do something.

# open the connection to a text file containing data.
path <- "https://pmongeon.github.io/info6270/files/data/titanic.txt"
con <- file(path, open="rb")

# this will read 50 lines and print them until the end of the file has been reached.
repeat { 
  x<-readLines(con, n = 50)
  if (is_empty(x) == TRUE) {
    break
    } else { 
      print(x)
    }
} 

# Close the connection to the file.
close(con)

3.4.5.2 Connect to a database

You can also connect to practically all types of local or remote databases and servers as long as you have the required credentials. For example, the RMySQL package is great to work with MySQL databases. You can connect to a MySQL database (provided that you have a user name and password and the other information required by the dbConnect() function). With the RSQLite package, you can also work with local SQL databases stored in a single file on your computer with SQLite. The code below shows you how to connect to MySQL or SQLite databases in R.

## This creates a connection to a MySQL database for which you would need 
## access credentials.

library(RMySQL)
con = dbConnect(MySQL(),
               user = "username",
               password = "password",
               host = "host",
               port = port_number, 
               dbname = "database name")

# This opens a connection called db to the info6270 SQLite database.
# If the database doesn't exist, this code will also create it.
library(RSQLite)
con <- dbConnect(drv = SQLite(), dbname= "C:/info6270.db")

Once the connection is established, you can interact with the database with a set of functions, some of which are presented in the following code:

# lists table in the database access through the db connection.
dbListTables(con)

# liste fields from a table
dbListFields(con,"table_name")

# import the results of a SQL query 
data <- dbGetQuery(con, "SQL query")

# import all the data from a table
data <- dbReadTable(con, "table name")

# upload a data frame to a table (change options as needed)
dbWriteTable(con, data, "table name", row.names=FALSE, overwrite = FALSE, append = TRUE)

3.5 Tidy your data

Now we know how to import data into R. However, not all data comes in a nice and tidy shape, even if the data is already shaped like a table. In this section, we’ll explore how to change the structure of your data frames with tidyr (https://tidyr.tidyverse.org/).

3.5.1 Reshape data

The reshape functions include pivot_longer() and its opposite, pivot_wider(). Let’s look at how they work. First we create a tibble with fictional amounts of funding received from the Canadian Tri-Council by some Nova Scotian universities.

# This creates a funding tibble with 4 columns and some data.
funding <- tibble(university = as.character(c("DAL","SMU","SFX")),
                 SSHRC = as.numeric(sample(1:100,3)),
                 NSERC = as.numeric(sample(1:100,3)),
                 CIHR = as.numeric(sample(1:100,3)))

# Note: the sample(1:100,3) function randomly chooses three values between 1 and 100.
funding
# A tibble: 3 × 4
  university SSHRC NSERC  CIHR
  <chr>      <dbl> <dbl> <dbl>
1 DAL           70    96    44
2 SMU            8    12   100
3 SFX           59    97    79

3.5.1.1 pivot_longer

The pivot_longer() function makes your data longer by appending multiple columns together in two column.

funding = pivot_longer(funding, cols = c("SSHRC","NSERC","CIHR"), names_to = "funder", values_to = "funding_amount")

funding
# A tibble: 9 × 3
  university funder funding_amount
  <chr>      <chr>           <dbl>
1 DAL        SSHRC              70
2 DAL        NSERC              96
3 DAL        CIHR               44
4 SMU        SSHRC               8
5 SMU        NSERC              12
6 SMU        CIHR              100
7 SFX        SSHRC              59
8 SFX        NSERC              97
9 SFX        CIHR               79

3.5.1.2 pivot_wider

The pivot_wider() function does the opposite of pivot_longer() and takes a column with names and another with values and creates a new column for each name and storing the value in it. The following example will perhaps make this clearer.

funding = pivot_wider(funding, names_from = funder, values_from = funding_amount)

funding
# A tibble: 3 × 4
  university SSHRC NSERC  CIHR
  <chr>      <dbl> <dbl> <dbl>
1 DAL           70    96    44
2 SMU            8    12   100
3 SFX           59    97    79

3.5.2 Expand tables

Let’s create a new funding tibble to explore some more tidyr functions.

funding <- tibble(university = as.character(c("DAL","DAL","DAL","SMU","SMU","SFX","SFX")),
                 funder = as.character(c("SSHRC","NSERC","CIHR","SSHRC","CIHR","NSERC","CIHR")),
                 n_grants = as.numeric(sample(1:100, 7))) 

print(funding)
# A tibble: 7 × 3
  university funder n_grants
  <chr>      <chr>     <dbl>
1 DAL        SSHRC        78
2 DAL        NSERC        58
3 DAL        CIHR         95
4 SMU        SSHRC        82
5 SMU        CIHR         30
6 SFX        NSERC        71
7 SFX        CIHR         43

3.5.2.1 expand

The expand() function creates all possible combinations of the data in two or more columns indicated in the function’s argument and drops other columns. The example below returns the possible combinations of universities and funders in the funding tibble we created. You can see that the SMU - NSERC combination and the SFX - SSHRC combination appear even though they are not present in the original funding tibble.

expand(funding, university, funder)
# A tibble: 9 × 2
  university funder
  <chr>      <chr> 
1 DAL        CIHR  
2 DAL        NSERC 
3 DAL        SSHRC 
4 SFX        CIHR  
5 SFX        NSERC 
6 SFX        SSHRC 
7 SMU        CIHR  
8 SMU        NSERC 
9 SMU        SSHRC 

3.5.2.2 complete

The complete() function does the same thing as the expand() function but it keeps all the columns of the original tibble.

funding = complete(funding, university, funder)

print(funding)
# A tibble: 9 × 3
  university funder n_grants
  <chr>      <chr>     <dbl>
1 DAL        CIHR         95
2 DAL        NSERC        58
3 DAL        SSHRC        78
4 SFX        CIHR         43
5 SFX        NSERC        71
6 SFX        SSHRC        NA
7 SMU        CIHR         30
8 SMU        NSERC        NA
9 SMU        SSHRC        82

3.5.3 Handle missing values

Sometimes your dataset is incomplete for one reason or another. Such a reason could be a participant to a survey that did not complete the survey or answer all the questions. What to do with the missing values or incomplete records depends on the nature of the data and the goal of your analysis.

3.5.3.1 drop_na

The drop_na() function is useful to remove incomplete observations in the data (i.e., rows for which one of the column contains no value. The code below includes a statement that removes all incomplete observations, and one where we specify the columns names that should be scanned for NA values (the rows won’t be deleted if there are NAs in the other columns).

# This removes all incomplete observations
drop_na(funding) 
# A tibble: 7 × 3
  university funder n_grants
  <chr>      <chr>     <dbl>
1 DAL        CIHR         95
2 DAL        NSERC        58
3 DAL        SSHRC        78
4 SFX        CIHR         43
5 SFX        NSERC        71
6 SMU        CIHR         30
7 SMU        SSHRC        82
# This would remove rows where the n_grants column is NA and leave NA values in other
# columns.
drop_na(funding, n_grants)
# A tibble: 7 × 3
  university funder n_grants
  <chr>      <chr>     <dbl>
1 DAL        CIHR         95
2 DAL        NSERC        58
3 DAL        SSHRC        78
4 SFX        CIHR         43
5 SFX        NSERC        71
6 SMU        CIHR         30
7 SMU        SSHRC        82

3.5.3.2 replace_na

Instead of deleting rows with missing values, we me may wish to replace NAs with some other value (e.g., 0). We can do that with replace_na(). You have to provide the columns for which you want to replace NAs and the values to replace the NAs with in a list.

replace_na(funding, list(n_grants = 0))
# A tibble: 9 × 3
  university funder n_grants
  <chr>      <chr>     <dbl>
1 DAL        CIHR         95
2 DAL        NSERC        58
3 DAL        SSHRC        78
4 SFX        CIHR         43
5 SFX        NSERC        71
6 SFX        SSHRC         0
7 SMU        CIHR         30
8 SMU        NSERC         0
9 SMU        SSHRC        82

If we want to replace NAs in multiple columns, we simply need to put additional column names and values to replace NAs with in the list. The code below replaces the NAs with 0 in two columns of some tibble.

replace_na(some_tibble, list(some_column = 0, some_other_column = 0))
Caution

Replacing null values with zeros for a numerical variable may seems logical, and it may be appropriate for your intended purpose. However, NA and 0 are not logically equivalent: NAs implies the absence of data (somethings that we did not or could not observe), and zeros imply an observed or measured value of 0.

3.5.4 Merge and split cells

Let’s create a simple tibble with some names.

my_tibble <- tibble(first_name = c("Jos","May"),
                 last_name = c("Louis","West"))

print(my_tibble)
# A tibble: 2 × 2
  first_name last_name
  <chr>      <chr>    
1 Jos        Louis    
2 May        West     

3.5.4.1 unite

In some scenario, it may be more useful to have the first names and last names combined into a full_name column. This can be achieved with the unite() function.

my_tibble = unite(my_tibble,
                  "first_name","last_name", # columns to unite.
                  col = "full_name", # name of the new column.
                  sep = " ") # Tells R to seperate the first and last name with a space.

print(my_tibble)
# A tibble: 2 × 1
  full_name
  <chr>    
1 Jos Louis
2 May West 

3.5.4.2 separate

In the opposite scenario where you have a full_name column but would rather have a first_name and a last_name column, you can use the separate() function to do that.

my_tibble = separate(my_tibble,
                     full_name, # column to separate.
                     sep = " ", # character string to use as separator.
                     into = c("first_name", "last_name")) # names of the new columns.

print(my_tibble)
# A tibble: 2 × 2
  first_name last_name
  <chr>      <chr>    
1 Jos        Louis    
2 May        West     

3.5.4.3 separate_rows

The separate_row() function is similar to the separate() function, but instead of creating new columns it creates new rows. In the following example, we have a tibble containing two articles for which we have a column title and an authors column in which the authors are separated with a semi-colon.

# This creates some tibble contain names of authors in a single cell with a separator.
my_tibble = tibble(title = c("awesome article","boring article"),
                   authors = c("Toze, S.; Brown, A.","Smith, J.; Roberts, J."))

print(my_tibble)
# A tibble: 2 × 2
  title           authors               
  <chr>           <chr>                 
1 awesome article Toze, S.; Brown, A.   
2 boring article  Smith, J.; Roberts, J.

This is not tidy data, so we want to separate the authors so that there is only one author per cell. This will create a tidy associative table between articles and authors. Let’s use separate_rows() to do that.

my_tibble = separate_rows(my_tibble, authors, sep = "; ")

print(my_tibble)
# A tibble: 4 × 2
  title           authors    
  <chr>           <chr>      
1 awesome article Toze, S.   
2 awesome article Brown, A.  
3 boring article  Smith, J.  
4 boring article  Roberts, J.
Choose your separator carefully

Notice how our separator in the code above was not just a semi-colon, but a semi-colon followed by a space. That is because in the original tibble, there were two characters between the end of an author’s name and the beginning of the other: a semi-colon and a space. If we had just used the semi-colon as our separator in the separate_rows(), the space would have been kept as the first character of the names of all authors except the first one.

3.5.5 Nested data

The XML file import mentioned earlier is a good opportunity to explore the process of tidying nested data. converting an XML file to the tidy format requires quite a few steps. Here is a pretty standard, step by step process that you can follow whenever you are dealing with XML files.

# load the xml2 package
library(xml2)

# read an xml file 
xml = read_xml("https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml")

The xml object is storing the data in its original xml format. You can verify that indeed the file is in XML by writing the xml object to a file and looking at it.

write_xml(xml, file="c:/data/xml.xml")

The xml2 package provides a function (as_list) to convert XML documents into an equivalent R list. Beware: as_list() should not be confused with as.list(), which is a completely different, base R, function.

list <- as_list(xml)

The next step is converting that list into a tibble. So let’s do that.

public_housing <- as_tibble(list)
print(public_housing)
# A tibble: 1 × 1
  response          
  <named list>      
1 <named list [342]>

At the moment it has only one column called “response” of the “named_list” type, containing a single cell with 342 named list. This is a nested structure, where a single cell contains a list that contains other lists that contain other lists.

3.5.5.1 unnest_longer

So, let’s unnest the data in the response column using the unnest_longer() function. This function makes the tibble longer by expaning it vertically (adding rows) with the unnested data.

public_housing <- unnest_longer(public_housing, response)

print(public_housing)
# A tibble: 342 × 2
   response          response_id
   <named list>      <chr>      
 1 <named list [22]> row        
 2 <named list [22]> row        
 3 <named list [21]> row        
 4 <named list [22]> row        
 5 <named list [22]> row        
 6 <named list [21]> row        
 7 <named list [22]> row        
 8 <named list [22]> row        
 9 <named list [22]> row        
10 <named list [22]> row        
# ℹ 332 more rows

Now we’re making progress. We have 342 observations each occupying one line. This is one of the tidy criteria! However, the columns are not right. We can see that there are two columns. The first one appears to contain 342 list of 22 elements, which are probably the 22 variables that we want as our columns.

3.5.5.2 unnest_wider

Let’s unnest these 342 lists using the unnest_wider() function. This function makes the tibble wider by expaning it horizontally (adding columns) with the unnested data.

public_housing <- unnest_wider(public_housing, response)

print(public_housing)
# A tibble: 342 × 23
   id         property_project pid        name       address city   postal_code
   <list>     <list>           <list>     <list>     <list>  <list> <list>     
 1 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
 2 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
 3 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <NULL>     
 4 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
 5 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
 6 <list [1]> <list [1]>       <list [1]> <list [1]> <NULL>  <list> <list [1]> 
 7 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
 8 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
 9 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
10 <list [1]> <list [1]>       <list [1]> <list [1]> <list>  <list> <list [1]> 
# ℹ 332 more rows
# ℹ 16 more variables: number_of_floors <list>, residential_units <list>,
#   housing_authority <list>, county <list>, elevator <list>, oil_heat <list>,
#   electric_heat <list>, public_water <list>, well <list>, sewer <list>,
#   onsite_septic <list>, municipality <list>, x_coordina <list>,
#   y_coordina <list>, location <lgl>, response_id <chr>

we can now see that each cell is a list of 1 element.

3.5.5.3 unnest

Now let’s unnest the data contained in each cell with the unnest() function.

unnest()

public_housing = unnest(public_housing, cols = c(id, property_project, pid, name, address, city, postal_code, number_of_floors, residential_units, housing_authority, county, elevator, oil_heat, electric_heat, public_water, well, sewer, onsite_septic, municipality, x_coordina, y_coordina))

This seems silly… there has to be a leaner way of writing this. Yes, there is. The names() returns a vector of the column names of a data frame. So we can use this much leaner code:

#create the vector with the column names
colnames = names(public_housing)

#use that vector for the cols argument of the unnest() function.
public_housing = unnest(public_housing, cols = colnames)

print(public_housing)
# A tibble: 342 × 23
   id    property_project pid   name  address city  postal_code number_of_floors
   <lis> <list>           <lis> <lis> <list>  <lis> <list>      <list>          
 1 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
 2 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
 3 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <NULL>      <chr [1]>       
 4 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
 5 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
 6 <chr> <chr [1]>        <chr> <chr> <NULL>  <chr> <chr [1]>   <chr [1]>       
 7 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
 8 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
 9 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
10 <chr> <chr [1]>        <chr> <chr> <chr>   <chr> <chr [1]>   <chr [1]>       
# ℹ 332 more rows
# ℹ 15 more variables: residential_units <list>, housing_authority <list>,
#   county <list>, elevator <list>, oil_heat <list>, electric_heat <list>,
#   public_water <list>, well <list>, sewer <list>, onsite_septic <list>,
#   municipality <list>, x_coordina <list>, y_coordina <list>, location <lgl>,
#   response_id <chr>

We now have a tibble with 342 observation of 22 variables. Looks good! But the values in the cells still don’t look right. There’s just one more round of unnesting to apply to all the cell using the same code as in the last step.

public_housing = unnest(public_housing, cols = colnames)

print(public_housing)
# A tibble: 342 × 23
   id    property_project pid   name  address city  postal_code number_of_floors
   <chr> <chr>            <chr> <chr> <chr>   <chr> <chr>       <chr>           
 1 117   330101X          6503… Rive… 19/27/… Pict… B0K 1S0     2               
 2 324   200214X          5006… Bell… 10 Gra… Gran… B0E 1L0     2               
 3 247   100215X          1511… Sydn… 77 Ain… Sydn… <NA>        1               
 4 337   390201X          2522… Elgi… 22 Elg… Spri… B0M 1X0     1               
 5 224   100201X          1513… Will… 10 Wil… Sydn… B1N 1R4     2               
 6 289   150226X          1524… 198 … <NA>    Flor… B0C 1J0     2               
 7 214   612301           9021… Trin… 3 Trin… Yarm… B5A 1P3     2               
 8 143   350801           2017… Youn… 130 Yo… Truro B2N 3X4     2               
 9 104   740101           6018… Ritc… 3809 H… Rive… B0J 2W0     2               
10 174   430901           4017… Dr. … 3792 N… Hali… B3K 3G5     9               
# ℹ 332 more rows
# ℹ 15 more variables: residential_units <chr>, housing_authority <chr>,
#   county <chr>, elevator <chr>, oil_heat <chr>, electric_heat <chr>,
#   public_water <chr>, well <chr>, sewer <chr>, onsite_septic <chr>,
#   municipality <chr>, x_coordina <chr>, y_coordina <chr>, location <lgl>,
#   response_id <chr>

Wow. Now it looks really good! But wait… there’s one more thing. Look at the data types of the columns. They are all characters, which doesn’t seem right because some of our columns appear to contain only numerical data. The readr package comes to the rescue with its easy to use type_convert() function.

public_housing = type_convert(public_housing)

print(public_housing)
# A tibble: 342 × 23
      id property_project      pid name                address city  postal_code
   <dbl> <chr>               <dbl> <chr>               <chr>   <chr> <chr>      
 1   117 330101X          65032708 Riverton Heights I  19/27/… Pict… B0K 1S0    
 2   324 200214X          50067362 Bellevue            10 Gra… Gran… B0E 1L0    
 3   247 100215X          15113962 Sydney Senior Dupl… 77 Ain… Sydn… <NA>       
 4   337 390201X          25227687 Elgin Street Villa  22 Elg… Spri… B0M 1X0    
 5   224 100201X          15130651 William Street S/C  10 Wil… Sydn… B1N 1R4    
 6   289 150226X          15242738 198 Pitt Street     <NA>    Flor… B0C 1J0    
 7   214 612301           90212812 Trinity Place       3 Trin… Yarm… B5A 1P3    
 8   143 350801           20178653 Young St. Lodge     130 Yo… Truro B2N 3X4    
 9   104 740101           60187283 Ritcey's Cove Mano… 3809 H… Rive… B0J 2W0    
10   174 430901           40177982 Dr. Prince Manor (… 3792 N… Hali… B3K 3G5    
# ℹ 332 more rows
# ℹ 16 more variables: number_of_floors <dbl>, residential_units <dbl>,
#   housing_authority <chr>, county <chr>, elevator <chr>, oil_heat <chr>,
#   electric_heat <chr>, public_water <chr>, well <chr>, sewer <chr>,
#   onsite_septic <chr>, municipality <chr>, x_coordina <dbl>,
#   y_coordina <dbl>, location <lgl>, response_id <chr>

We have now converted an XML document into an easy to use tidy dataset! Here is the entire process that we just went through in a single chunk of code.

xml = read_xml("https://data.novascotia.ca/api/views/2d4m-9e6x/rows.xml")
list = as_list(xml)
public_housing = as_tibble(list) 
public_housing = unnest_longer(public_housing, col = response)
public_housing = unnest_wider(public_housing, col = response) 
public_housing = unnest(public_housing, cols = names(public_housing))  
public_housing = unnest(public_housing, cols = names(public_housing))
public_housing = type_convert(public_housing)

3.5.6 Combine data frames

Sometimes the data will come in separate files so you will have to combine the pieces into a single and tidy dataset.

3.5.6.1 bind_rows

The bind_rows() function concatenates data frames vertically, adding new columns when column names differ between data frames.

# Create two tibbles with student info
students1 <- tibble(student_id = c(1,2,3),
                   name = c("Francis","Sam","Amy"))

students2 <- tibble(name = c("Sue","Bill","Lucy"),
                    email = c("Sue123@fakemail.com","bill@info620.com","lucy@lucy.org"))

# Apply bind_rows function
bind_rows(students1, students2)
# A tibble: 6 × 3
  student_id name    email              
       <dbl> <chr>   <chr>              
1          1 Francis <NA>               
2          2 Sam     <NA>               
3          3 Amy     <NA>               
4         NA Sue     Sue123@fakemail.com
5         NA Bill    bill@info620.com   
6         NA Lucy    lucy@lucy.org      

3.5.6.2 bind_cols

The bind_cols() function concatenates the entire data frames horizontally. The number of rows in the two databases must be the same.

# Create a tibble with student info
students3 <- tibble(student_id = c(4,5,6),
                   name = c("Bill","Lucy","Sue"))


# Apply bind_rows function
bind_cols(students3, students2)
# A tibble: 3 × 4
  student_id name...2 name...3 email              
       <dbl> <chr>    <chr>    <chr>              
1          4 Bill     Sue      Sue123@fakemail.com
2          5 Lucy     Bill     bill@info620.com   
3          6 Sue      Lucy     lucy@lucy.org      

The result is a tibble with four columns, the two columns from the students2 tibble and the two columns from the students3 tibble. So the function did it’s job. However, notice how the order of the students were not the same in both tibble? Also notice how there are now two columns with the names of the students? That’s because bind_col() does nothing more than sticking the two tibbles together without making any attempts to connect the data from one tibble to the data in the other tibble (e.g. matching the name so that the student_id and the emails are accurately paired). This also means that bind_cols() requires that both tibbles have the same number of rows, otherwise we get an error.

An often better way of bringing together two tibbles is to use joins.

3.5.7 Join data frames

The join family of functions is used to combine rows and columns of data frames based on a matching criteria. This means that the number of rows and their order in the joined tibbles does not matter.

Let’s create two new tibble to illustrate how the joins work

instructors_schools <- tibble(name = c("Colin","Sam","Dominika","Alana"),
                              school = c("SIM","RSB","SPA","SRES"))
  
instructors_emails <- tibble(name = c("Colin", "Sam", "Sandy", "Lindsey"),
                             email = c("cc@dalhousie.uni","st@dalhousie.uni","ss@dalhousie.uni","lm@dalhousie.uni")) 

3.5.7.1 left_join

The left_join() function returns all records from the first data frame and only records from the second data frame for which the matching condition is TRUE. Here is an example where we start with a list of instructors and their school (instructors_schools tibble) and want to obtain their email (instructors_emails tibble).

left_join(instructors_schools, instructors_emails, by="name")
# A tibble: 4 × 3
  name     school email           
  <chr>    <chr>  <chr>           
1 Colin    SIM    cc@dalhousie.uni
2 Sam      RSB    st@dalhousie.uni
3 Dominika SPA    <NA>            
4 Alana    SRES   <NA>            

Notice how the name column is not duplicated? That is because the name is the key used for the matching of the two tibbles.

3.5.7.2 right_join

The right_join() function is returns all records from the second data frame and only records from the first data frame for which the matching condition is TRUE. Here is an example:

right_join(instructors_schools, instructors_emails, by="name")
# A tibble: 4 × 3
  name    school email           
  <chr>   <chr>  <chr>           
1 Colin   SIM    cc@dalhousie.uni
2 Sam     RSB    st@dalhousie.uni
3 Sandy   <NA>   ss@dalhousie.uni
4 Lindsey <NA>   lm@dalhousie.uni

Since we used a right_join() here, R kept all rows from the right table (the second one listed in the function’s body), and kept only the names and schools of the instructors that were found in the instructors_emails tibble.

3.5.7.3 inner_join

The inner_join() function returns only records for which the matching condition is TRUE. It excludes all records that were not matched from both tibbles. Here is an example:

inner_join(instructors_schools, instructors_emails, by="name")
# A tibble: 2 × 3
  name  school email           
  <chr> <chr>  <chr>           
1 Colin SIM    cc@dalhousie.uni
2 Sam   RSB    st@dalhousie.uni

Colin and Sam were the only instructors for which we had both a school and an email.

3.5.7.4 anti_join

The anti_join() function is useful if you want to exclude observations (rows) in your tibble when the matching condition is TRUE.

anti_join(instructors_schools, instructors_emails, by="name")
# A tibble: 2 × 2
  name     school
  <chr>    <chr> 
1 Dominika SPA   
2 Alana    SRES  

3.5.7.5 full_join

The full_join() function is essentially a combination of the left_join() and right_join() functions. It will keep all the rows from both tibbles that are being joined.

full_join(instructors_schools, instructors_emails, by="name")
# A tibble: 6 × 3
  name     school email           
  <chr>    <chr>  <chr>           
1 Colin    SIM    cc@dalhousie.uni
2 Sam      RSB    st@dalhousie.uni
3 Dominika SPA    <NA>            
4 Alana    SRES   <NA>            
5 Sandy    <NA>   ss@dalhousie.uni
6 Lindsey  <NA>   lm@dalhousie.uni

Note: You can use the functions without specifying the column(s) to use for the matching. In this case, all columns with the same names will be used for the matching. Because te name column is the only one that exists in both the tibbles used in this example, the result of the following code is the same as in our inner_join() example above.

inner_join(instructors_schools, instructors_emails)
# A tibble: 2 × 3
  name  school email           
  <chr> <chr>  <chr>           
1 Colin SIM    cc@dalhousie.uni
2 Sam   RSB    st@dalhousie.uni

3.5.8 Rename columns

For various reasons, we may want to rename columns in our tibble. We can do this with the rename() function. Let’s capitalize the first letter of the column names from our instructors_schools tibble.

Important: The syntax of the rename() function goes new_column_name = orginal_column_name.

instructors_schools <- rename(instructors_schools, Name = name, School = school)

print(instructors_schools)
# A tibble: 4 × 2
  Name     School
  <chr>    <chr> 
1 Colin    SIM   
2 Sam      RSB   
3 Dominika SPA   
4 Alana    SRES  

3.5.9 Select columns

The select() variable is used to retrieve a subset of columns from a data frame or tibble.

# Select the School column from the instructors_schools tibble.
select(instructors_schools, School)
# A tibble: 4 × 1
  School
  <chr> 
1 SIM   
2 RSB   
3 SPA   
4 SRES  

We can select and rename columns at the same time, like this:

# Select the School column from the instructors_schools tibble 
# and rename it to Department.
select(instructors_schools, Department = School)
# A tibble: 4 × 1
  Department
  <chr>     
1 SIM       
2 RSB       
3 SPA       
4 SRES      

3.5.10 Filter rows

The filter() functions is used to retrieve a subset of rows from a data frame or tibble based on some criteria. For example, we could filter a tibble and include only the rows here x1 is lower than 3.

filter(instructors_schools, School == "RSB")
# A tibble: 1 × 2
  Name  School
  <chr> <chr> 
1 Sam   RSB   
# you can use the %in% operator to select observations for which the value is in a list.

We can also use the %in% operator to select observations for which the value is in a vector.

filter(instructors_schools, School %in% c("RSB","SPA"))
# A tibble: 2 × 2
  Name     School
  <chr>    <chr> 
1 Sam      RSB   
2 Dominika SPA   

3.5.11 Modify or create new variables

The mutate() function is used to create a new variable or update an existing one.

# This adds a column called Faculty that contains the string "Faculty of Management".
mutate(instructors_schools, Faculty = "Faculty of Management")
# A tibble: 4 × 3
  Name     School Faculty              
  <chr>    <chr>  <chr>                
1 Colin    SIM    Faculty of Management
2 Sam      RSB    Faculty of Management
3 Dominika SPA    Faculty of Management
4 Alana    SRES   Faculty of Management
# This updates the already existing variable Faculty with the new value "FoM".
mutate(instructors_schools, Faculty = "FoM")
# A tibble: 4 × 3
  Name     School Faculty
  <chr>    <chr>  <chr>  
1 Colin    SIM    FoM    
2 Sam      RSB    FoM    
3 Dominika SPA    FoM    
4 Alana    SRES   FoM    

Note that you can also calculate the value assign to a variable with the mutate() function.

You can also use conditional statements in the mutate() function. For example, say we want to add a new column with the name of the schools spelled out. Of course, the values in that new column will depend on the value in the School column. Since there are more than two schools, we should use the case_when() function, like in the following example.

instructors_schools <- mutate(instructors_schools, 
                              school_name = case_when(School == "RSB" ~ "Rowe School of Business",
                                                      School == "SIM" ~ "School of Information Management",
                                                      School == "SPA" ~ "School of Public Administration",
                                                      School == "SRES" ~ "School for Resources and Environmental Studies"))

We can also use ifelse(condition, value if TRUE, value if FALSE) when there is only two possible outcomes. For example, say some of our instructors are on leave, and we want to add a column to our instructors_schools tibble that indicates whether the instructor is on leave or not. We could do it this way.

# vector of people on leave
instructors_on_leave <- c("Dominika","Colin")

# Creates a variable called on_leave that is 1 if the person is on leave and 0 otherwise
instructors_schools <- mutate(instructors_schools, 
                              on_leave = ifelse(Name %in% instructors_on_leave,1,0))

print(instructors_schools)
# A tibble: 4 × 4
  Name     School school_name                                    on_leave
  <chr>    <chr>  <chr>                                             <dbl>
1 Colin    SIM    School of Information Management                      1
2 Sam      RSB    Rowe School of Business                               0
3 Dominika SPA    School of Public Administration                       1
4 Alana    SRES   School for Resources and Environmental Studies        0

3.6 Export data

Now that you have a nice tidy data set, you may want to export it so you can save it for future use and not have to repeat the data collection and tidying process, or maybe because you want to share it with others.

3.6.1 Export data frames

Exporting data frames is very easy in R (the hardest part was getting there). The readr package has a bunch of functions to write files. The most common are:

  • write_csv() for comma-separated files (and it’s cousin write_csv2() that uses the european format (commas instead of dot for decimals, and semicolon as delimiter);

  • write_tsv() for tab-delimited files;

  • write_delim() which allows you to specify the separator (it can be anything you want, even the word “banana”)

  • write_xlsx() from the writexl package can be used to write Excel files.

Here’s how we would use these functions to write data to a file. Note that, in the example, we use the col_names = TRUE argument to ensure that the first line of the created files will contain the column names. the default setting of write_xlsx() bolds and centers the column names. In the example below, we chose to disable this feature with the format_headers = FALSE argument.

library(readr)
write_csv(my_tibble, file = "my_data.csv", col_names = TRUE)

write_csv2(my_tibble, file = "my_data.csv", col_names = TRUE) 

write_tsv(my_tibble, file = "my_data.tsv", col_names = TRUE)

write_delim(my_tibble, file = "my_data.txt", delim = "\t", col_names = TRUE)

library(writexl)
write_xlsx(my_tibble, path = "my_data.xlsx", col_names = TRUE, format_headers = FALSE)

The write_xlsx() function can also write multiple tibbles into different sheets of a single Excel file, which can be quite useful. To do that, we need to provide a list of tibbles rather than a single tibble as the intput to the function.

Assuming that we have already created several tibbles and named them my_tibble1, my_tibble2, and my_tibble3. Here’s how we would create a list containing the three tibbles and export it into an Excel file.

# Create a list of tibbles tibbles
my_tibble_list <- list(my_tibble1, my_tibble2, my_tibble3)

write_xlsx(my_tibble_list, path="my_data.xlsx", col_names = TRUE, format_headers = FALSE)

3.6.2 JSON

Writing JSON files from a data frame with the toJSON() function is just as easy as reading them with the fromJSON() function. The following code reads a CSV file containing the Titanic dataset, and then exports the data to a JSON file. Note the that pretty=TRUE argument spaces out the JSON file so it is more readable for humans.

# Reads the CSV file into a data frame.
titanic <- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")

# Writes the data frame to a JSON file.
write_file(toJSON(titanic, pretty=TRUE), "titanic.json") 

3.7 Writing readable code

There are many ways of writing code to achieve the same process. You can structure the data in different ways, use different functions from different packages, but we can also use the exact same functions in exactly the same way but with very different code. Most of our examples so far were relatively simple and the processes did not involve a large number of steps. However, when they did, most of the steps were coded sequentially, one statement at a time. This makes all the steps explicit and clear, but it also makes the code longer than it needs to be. We can also write our code so that the entire process is in a single statement that contains other statements, that contain other statements, an so on. Let’s use an example to compare the two approaches. Say we want to use the Titanic dataset to create a CSV file that contains only the name, sex, age of the survivors. The steps would be:

  1. Load the Titanic dataset.
  2. Convert it into a tibble (this step is not actually necessary, we use it to add an extra step to the process).
  3. Filter the dataset to include only the survivors.
  4. Select only the Name, Sex, and Age columns.
  5. Write the tibble to a .csv file.

We can write this in a single statement, like this.

write_csv(select(filter(as_tibble(read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")),Survived == 1), Name,Sex,Age), file = "titanic_survivors.csv", col_names = TRUE)

Just like in mathematical equations, the operations are completed from the inside out. So the first step, read_csv("titanic") is in the middle, and it is inside the statement for the second step, as_tibble(), which is inside the third statemnt, filter(, Survived == 1), and so on. The outer statement is the write_csv() function with its argument, file = "titanic_survivors.csv, col_names = TRUE, appearing at the very end of the code.

This may be efficient in terms of the number of characters needed to write the code, but perhaps a little difficult to read.

Now lets do the same thing by writing distinct statements for each step.

# Load the Titanic dataset.
data <- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")

# Convert it into a tibble
data <- as_tibble(data)

# Filter the dataset to include only the survivors.
data <- filter(data, Survived == 1)

# Select only the Name, Sex, and Age columns.
data <- select(data, Name, Sex, Age)

# Write the tibble to a .csv file.
write_csv(data, "titanic_survivors.csv")

This is very clear but perhaps not so efficient in terms of characters (the name of the tibble gets repeated many times). Also, all five statement will need to be executed one by one, or we will need to select all the code to run it all at once. This is perhaps not such a big deal, but the single statement version does have an edge there.

3.7.1 The pipe

The dplyr and maggritr packages (automatically loaded with the tidyverse) include an extremely useful and popular operator called the pipe which looks like this in R: %>%. The keyboard shortcut is CTRL+SHIFT+M. The pipe is a wonderful tool for building lean code that are easier to read and to debug. It allows you to write multiple steps sequentially into a single statement. It combines the readability of the step by step approach with the efficiency of the single “embedded statement” approach. Here’s the same process written with the pipe.

read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv") %>% 
  as_tibble() %>% 
  filter(Survived == 1) %>% 
  select(Name, Sex, Age) %>% 
  write_csv("titanic_survivors.csv", col_names = TRUE)

When reading this code, I like to read the %>% as “and then”. So this would read, read the CSV, and then make it a tibble, and then limit the tibble to rows for which the Survived column contains 1, and then select only the Name, Sex, and Age columns, and then write the tibble to a file named “titanic_survivors.csv”.

As you can see, this is a lean and easily readable process which doesn’t require to create an object and explicitly read it, process it, and overwrite it for each step of the process. Note that having each step on a separate line is common and good practice but it is not required. In fact, R does not care about extra spaces, indentations, or line breaks in the code. We could have written this statement in a single line like this.

read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv") %>% as_tibble() %>% filter(Survived == 1) %>% select(Name, Sex, Age) %>% write_csv("titanic_survivors.csv", col_names = TRUE)

3.8 Additional resources

There are many cheatsheets available for R packages that can be very useful to quickly see what the different functions of the packages can do.