2 Data structures

2.1 From unstructured to structured data

Data can take many forms that we can situate along a continuum with unstructured data at one extreme, structured data at the other, and semi-structured data in between. In this section, we briefly explore what these concepts mean.

2.1.1 Unstructured data

Let us consider a simple example to better grasp the difference between structured and unstructured data and, at the same time, to get a sense of the process of structuring data so that it can be more effectively strored and analyzed.

Here is a brilliant piece of writing:

A data story
Philippe Mongeon
Chapter 1
Once upon a time, a random internet user created a short text and abandoned it in a sea of unstructured data.
Chapter 2
An algorithm passed by and decided to add XML tags to the text.
Chapter 3
A data manager stumbled upon the data and thought it would be relevant for the database users. So they decided to add it to their database.
The end.

That’s unstructured data (not entirely, the line breaks are a very basic structure). We humans who have read a book before can see the structure. We see a title, an author name, and three chapters that contain some text. This is our brains structuring the data that our eyes capture. The computer does not perform that structuring process by itself. For it, this is just a bunch of symbols. In fact, for a computer this is all 1s and 0s (binary data) because that’s the only type of data computers actually stores and understands. Because the computers we use are made by and for humans, we have programmed them to show us symbols we understand.

2.1.2 Semi-structured data

Another way to think about the data above is that it has an implicit structure. But humans have invented different tools to make the structure of data explicit. One of these tools is XML (Extensible Markup Language). It is very similar to the Hypertext Markup Language (HTML), which web browers use to interpret the structure of a webpage (such as the one you are currently viewing). These tagging systems use tags to allow humans and computer to differentiate between different parts of the data, and to indicate what it represents (e.g., a title, an author, a paragraph, etc.)

Here is how, for example, we can apply XML tags to give our data some structure.

<book>
  <title>A data story</title>
  <author>Philippe Mongeon</author>
  <chapter>
    <title>Chapter 1</title>
    <paragraph>Once upon a time a short text was created by a random internet user
    and abandoned in a sea of unstructured data.</paragraph>
  </chapter>
  <chapter>
    <title>Chapter 2</title>
    <paragraph>An algorithm passed by and decided to add xml tags to the text.</paragraph>
  </chapter>
  <chapter>
    <title>Chapter 3</title>
    <paragraph>A data manager stumbled upon the data, thought it would be relevant
    for the database users, and so she decided to add it to include it in her
    database.</paragraph>
    <paragraph>The end.</paragraph>
  </chapter>
</book>

As you can see, the XML tags already help us (and the computer) identify different text components, such as the title, the author, headers and paragraphs. XML documents are a typical example of semi-structured data.

Another frequently used semi-structured format is JSON. It is a name-value pair format. We can use it to structure our text by identifying the different parts (the names) and their content (the values).

{
  "title": "A data story",
  "author": "Philippe Mongeon",
  "Chapter 1": "Once upon a time a short text was created by a random internet user and abandoned in a sea of unstructured data.",
  "Chapter 2": "An algorithm passed by and decided to add xml tags to the text.",
  "Chapter 3": "A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to her database. The end."
}

Notice how I included the Chapter number in the tags in the JSON example above? That was an arbitrary choice, and I could have chosen another structure for my file, such as this one:

{
  "title": "A data story",
  "author": "Philippe Mongeon",
  "Chapters": [
      {
        "title": "Chapter 1",
        "content": "Once upon a time a short text was created by a random internet user and abandoned in a sea of unstructured data."
      }
      {
        "title": "Chapter 2",
        "content": "An algorithm passed by and decided to add xml tags to the text."
      }
      {
        "title": "Chapter 3",
        "content": "A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to her database. The end."
      }
  ]
}

This structure (or semi-structure) is just as good as the previous one. That flexibility is one of the reasons why semi-structured data formats and databases are growing in popularity. Because the data structure is included directly in the document, the users do not need to follow a predefined structure and use predefined tags to store their data in the database. This means that, theoretically, every record could have a different structure, which would be fine, but most likely not optimal for an end user trying to make sense of a dataset and extract useful knowledge from it.

2.1.3 Structured data

Structured data and databases are typically tabular (consisting of rows and columns, like an Excel spreadsheet). Each row is a record or entry, and each column is a field (also called features, or variable). The structure is part of the database design (and not each database entry, as in the semi-structured format), and so every new record has to follow that same format. For example, here is the same work of art in a structured format.

Example of structured data
Title	Author	Section	Content
A data story	Philippe Mongeon	Chapter 1	Once upon a time a data was created by a random internet user and abandoned in a see of unstructured data.
A data story	Philippe Mongeon	Chapter 2	An algorithm passed by and decided to add xml tags to the text.
A data story	Philippe Mongeon	Chapter 3	A data manager stumbled upon the data, thought it would be relevant for the database users, and so she decided to add it to include it in her database. The end.

Here, the structure is determined by the columns I chose for my table (title, author, section, and content). But, again, this was an arbitrary choice, and I could have chosen a completely different structure (although the realm of possibilities is somewhat limited by the data and by common sense). The point is that there is rarely an absolute best way of structuring data, and the best structure is the one that best suits the needs of the users. Sometimes the same data may be duplicated and structured differently to suit different users and uses. There are however data structure standards that you should follow when using certain established data models and DBMS, but these are beyond the scope of this course.

2.2 The power of structure

The more structured a dataset is, the more easily it will be to use to extract knowledge by an end-user. Unfortunately, data is not always found in easily exploitable structures and will often require some cleaning and processing. There is a well-known saying that data scientists spend 80% of their time cleaning their data. Clean data that fits a clear purpose is actually extremely easy to analyze… the problem is that the purpose isn’t always clear in the end of the user, and data does not always come in the exact format needed for that purpose.

In this course, we will not deal with extremely complex structures and very messy data, but we will get comfortable with a few basic data processing tasks in Excel that will help you make that most of the imperfect data that you are likely to encounter in your professional life.

2.3 Practice

Since this course is intended to be accessible to students with little to no experience working with data, this week is dedicated to developing or polishing your Microsoft Excel skills.

While Excel has some data management capabilities, that’s not what it is designed and most commonly used for. You can still think of it as a very rudimentary form of DBMS. However, Excel is a flexible tool that you can use to quickly explore, manipulate and structure data before creating those structures in an actual DBMS. Excel will also always be relevant in your data management workflow because it remains one of the most accessible ways to work with data for downstream processes (before data gets stored in a DMBS) and upstream processes (after the data is retrieved from the DBMS).

2.3.0.1 Formatting exercise

Download the Easy_Excel.xlsx file (provided by Julie Marcoux, a data librarian at the Killam Library)
Follow the instructions in the instructions sheet.
The exercise sheet contains the data to format.
At the end of the exercise, your table should look similar to the one found in the Results sheet.
The Useful functions sheet contains tips on using some Excel formatting functions.

2.3.0.2 Analyzing exercise

Watch the demo on pivot tables in Excel below.
Create a pivot table to explore the data in the _The exercise sheet of the Easy_Excel.xlsx used for the first exercise.
Explore ways of structuring your pivot table and combining and filtering columns.
Which publisher has the most books in the dataset?
On average, are cloth-bound books more or less expensive than paper-bound books?