1 Thinking about data

1.1 Learning objectives

Understanding the concepts of data, dataset, database, data management, and data management systems.
Understanding the difference between unstructured, semi-structured, and structured data.

1.2 What is data?

The Merriam-Webster online dictionary provides three definitions of the word data:

factual information (such as measurements or statistics) used as a basis for reasoning, discussion, or calculation.

information in digital form that can be transmitted or processed

information output by a sensing device or organ that includes both useful and irrelevant or redundant information and must be processed to be meaningful

— Merriam-Webster dictionary

Together, these definitions offer us a set of key elements from which we can build a broad understanding of the concept of data. The first key term is factual information or fact suggesting that data is objective and, like the rest of the definition shows, is used for a given purpose, such as discussing reasoning or decision-making.

The second definition is related to using the word data in a computational or communicational sense, where data is the “thing” that is being stored, transmitted, received, processed, etc.

While the first definition suggests that humans and machines use data for processes such as decisions and calculations, the third definition highlights that data does not only exist in nature but can also be created by humans and machines, either purposefully or not.

While we often think of data as things found in spreadsheets and stored in computers or filing cabinets, data is much more than that. Data is everywhere around us all the time in the form of energy and sound or light waves, for instance. Our sensory organs are data captors that pick up data from our environment. Our brains process, structure, and possibly store the data so we can consciously or unconsciously use it now or later as a basis for decisions and actions. That said, in this course, we will not concern ourselves with this kind of data and process. Instead, we will focus on digitally recorded data, the kind that we can store in a computer.

1.3 Datasets

We often encounter the term “dataset” on the web or in our workplaces, and I think it is worth writing a few lines to relate the terms to the other terms we will use in this course. The terms data and dataset will often be used interchangeably since dataset literally means a set of data, and data is the plural of datum. One difference, in principle, is that datasets are usually assembled for a given purpose. In research, for instance, a dataset will be the exact collection of data collected for the analysis. In supervised machine learning, we distinguish between training and testing datasets. When a professor sends you an excel file with data to work with for an assignment, that’s a dataset. You find datasets when you browse websites like kaggle.com, zenodo.org, or dataverse.org. Datasets are also static, whereas databases can be dynamic.

1.4 Databases

What is a database? According to the Merriam-Webster dictionary, a database is “a usually large collection of data organized especially for rapid search and retrieval (as by a computer)”. The keyword here is organized, highlighting that databases are both products and tools for data management.

Databases are usually created and managed for some purposes. These purposes may be specific (e.g. keeping track of a store’s inventory) or broad (tracking socioeconomic trends). Depending on their purposes, databases can vary in size and complexity. Any organized data collection could be considered a database, even if it is as basic as an Excel spreadsheet with the names and addresses of your friends or your to-do list.

A database can contain or be used to create multiple datasets, but a dataset would typically not contain multiple databases. Of course, this does not mean that datasets are always drawn from databases. For example, datasets can be created by surveying or interviewing people or recording observations of natural phenomena.

Note, however, that those differences are not hard truths, as some datasets may serve a greater variety of users and purposes than some databases.

1.5 Database management systems (DBMS)

A Database Management System (DBMS) is software that supports the development, maintenance, security, and use of databases. You will often come across the DBMS acronym with different suffixes attached to it, such as RDBMS (Relational DBMS), OODBMS (Object-Oriented DBMS), or ORDBMS (Object-Relational DBMS). Note that all these DBMS generally offer the same basic features. The main difference is that they work with different data types and structures. In this course you will use one RDBMS called pgAdmin to put implement your database design on a live server. You will also be introduced to the two popular DBMSs: MongoDB and Neo4j for document databases and graph databases, respectively.

Most DBMS are pretty easy to get started with, and some will provide standard templates for tables and data structures. But as Hernandez (2013) writes in the introduction of their popular database design book, diving in without proper preparation and relying too much on templates is an easy way to end up with a design that is inefficient or does not work. A bad database design will lead to data problems such as missing, incorrect, or mismatched data. It will fail to play its role in supporting the organization or serving the users.

1.6 Data management

So far in this chapter, we explored the concepts of data and its different levels of structure, datasets, databases and data management systems. Aside from related to data, what do all these concepts have in common? They are what data managers work with. Data managers unlock the potential of data for a given purpose, individual, group or organization, by developing and implementing data strategies and processes such as data retrieval, processing, cleaning, storage and analysis.

The value of data depends on the purpose it serves. Thus, good data management requires a good understanding of both the data and the needs of its users so that optimal data strategies can be developed and implemented. In this course, you will get familiar with several tools to manage data and data-related processes and design a database that addresses the needs of an organization or group.

1.7 From unstructured to structured data

Data can take many forms that we can situate along a continuum with unstructured data at one extreme, structured data at the other, and semi-structured data in between. Let us consider a simple example to better grasp the difference between structured and unstructured data and, at the same time, to get a sense of the process of structuring data so that it can be more effectively strored and analyzed.

Here is a brilliant piece of writing:

A data story
Philippe Mongeon
Chapter 1
Once upon a time, a random internet user created a short text and abandoned it in a sea of unstructured data.
Chapter 2
An algorithm passed by and decided to add XML tags to the text.
Chapter 3
A data manager stumbled upon the data and thought it would be relevant for the database users. So they decided to add it to their relational database.
The end.

That’s unstructured data (not entirely, the line breaks are a very basic structure). We humans who have read a book before can see the structure, but for a computer, this is just text. So now, let us give this text a little bit more structure by applying XML tags.

<book>
<title>A data story</title>
<author>Philippe Mongeon</author>
<h1>Chapter 1</h1>
<p>Once upon a time a short text was created by a random internet user and abandoned in a sea of unstructured data.</p>
<h1>Chapter 2</h1>
<p>An algorithm passed by and decided to add xml tags to the text.</p>
<h1>Chapter 3</h1>
<p>A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to include it in her relational database.</p>
<p>The end.</p>
</book>

The XML tags already help us (and the computer) identify different text components, such as the title, the author, headers and paragraphs. XML documents are a typical example of semi-structured data.

Another frequently used semi-structured format is JSON, which in this case would look like this:

{
  "title": "A data story",
  "author": "Philippe Mongeon",
  "Chapter 1": "Once upon a time a short text was created by a random internet user and abandoned in a sea of unstructured data.",
  "Chapter 2": "An algorithm passed by and decided to add xml tags to the text.",
  "Chapter 3": "A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to her relational database. The end."
}

Notice how I included the Chapter number in the tags here in this example? That was an arbitrary choice, and I could have chosen another structure for my file, such as this one:

{
  "title": "A data story",
  "author": "Philippe Mongeon",
  "Chapters": [
      {
        "title": "Chapter 1",
        "content": "Once upon a time a short text was created by a random internet user and abandoned in a sea of unstructured data."
      }
      {
        "title": "Chapter 2",
        "content": "An algorithm passed by and decided to add xml tags to the text."
      }
      {
        "title": "Chapter 3",
        "content": "A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to her relational database. The end."
      }
  ]
}

This structure (or semi-structure) is just as good as the previous one. That flexibility is one of the reasons why semi-structured data formats and databases are growing in popularity. Because the data structure is included directly in the document, the users do not need to follow a predefined structure and use predefined tags to store their data in the database. This means that, theoretically, every record could have a different structure, which would be fine, but most likely not optimal. We will explore this type of structure and a popular DBMSs that uses it (MongoDB) later in the course.

Structured data and databases are typically tabular (like an Excel spreadsheet). Each row is a record or entry, and each column is a field, feature, variable, etc. The structure is part of the database design (and not each database entry, as in the semi-structured format), and so every new record has to follow that same format. For example, here is the same work of art in a structured format.

Example of structured data
Title	Author	Section	Content
A data story	Philippe Mongeon	Chapter 1	Once upon a time a data was created by a random internet user and abandoned in a see of unstructured data.
A data story	Philippe Mongeon	Chapter 2	An algorithm passed by and decided to add xml tags to the text.
A data story	Philippe Mongeon	Chapter 3	A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to include it in her relational database. The end.

Here, the structure is determined by the columns I chose for my table (title, author, section, and content). But, again, this was an arbitrary choice, and I could have chosen a completely different structure (although the realm of possibilities is somewhat limited by the data and by common sense). The point is that there is rarely an absolute best way of structuring data, and the best structure is the one that best suits the needs of the users. Sometimes the same data may be duplicated and structured differently to suit different users and uses.

Perhaps you noticed that the title and author information is repeated in this last example, which is not a great database design. This course will teach you how to avoid such redundancies and design sound and efficient relational databases. We will talk about the relational model later in the course, but for now, note that relational databases are types of databases that follow a set of rules and principles that make them best suited for structured data.

1.8 Practice

Over the next few weeks of the course, you will design a relational database, store data into it, and write SQL queries to retrieve data. But since this course is intended to be accessible to students with little to no experience working with data, this week is dedicated to developing or polishing your Microsoft Excel skills.

While Excel has some data management capabilities, that’s not what it is designed and most commonly used for. You can still think of it as a very rudimentary form of DBMS. However, Excel is a flexible tool that you can use to quickly explore, manipulate and structure data before creating those structures in an actual DBMS. Excel will also always be relevant in your data management workflow because it remains one of the most accessible ways to work with data for downstream processes (before data gets stored in a DMBS) and upstream processes (after the data is retrieved from the DBMS).

1.8.0.1 Formatting exercise

Download the Easy_Excel.xlsx file (provided by Julie Marcoux, a data librarian at the Killam Library)
Follow the instructions in the instructions sheet.
The exercise sheet contains the data to format.
At the end of the exercise, your table should look similar to the one found in the Results sheet.
The Useful functions sheet contains tips on using some Excel formatting functions.

1.8.0.2 Analyzing exercise

Watch the demo on pivot tables in Excel below.
Create a pivot table to explore the data in the _The exercise sheet of the Easy_Excel.xlsx used for the first exercise.
Explore ways of structuring your pivot table and combining and filtering columns.
Which publisher has the most books in the dataset?
On average, are cloth-bound books more or less expensive than paper-bound books?