3  Data Types

One of the most important piece of fundamental knowledge that you need to have if you are going to work efficiently with data is the existence of very basic data types: characters, numbers, dates, and logical (or boolean).

3.1 Why data types matter

Storage efficiency: This is not going to be a concern in the context of this course, but data, like any other objects stored on a computer, takes up memory space. Typically, numbers (especially small numbers) take up less memory space than text, for instance. When dealing with very large databases that support, for example, the operations of a large multinational organization, or a data intensive organization like Meta, using a sub-optimal data type for a column in the data can have significant implications for the data storage costs.

Data integrity: Because Excel is a very flexible tool, it does not do a good job at ensuring the integrity of the data. For example, a column can contain both text, numbers, and dates. For example, if you have a column called age and set the data type to number, you will still be able to enter the text Banana in that column, and Excel will automatically convert that cell, and only that cell, to the general (undefined)data type instead of throwing an error (see example below).

Name Age
Charlie Brown 12
Snow White Banana
Harry Potter 14

So when working with Excel documents, it is a good idea to explore the data set and try to spot and fix errors. It fact it is always a good idea to check the integrity of the data no matter what software is used to access it. Has the saying goes: garbage in, garbage out. We certainly don’t want to make important business or life decision based on erroneous data, as these errors depending on their frequency and nature, can drastically affect the results of analysis and have lead to decisions with devastating effects.

Data analysis: Data types have important implications for analysis, because you cannot perform the same operations on text and on numbers. When we learn about data analysis and data visualization later in the course, we will see that the data type is the main factor that determines how we can analyze or visualize the data to gain insights from it. For now, it is enough to understand that some operations are possible (or logical) for some data types but not for others.

3.2 Text

Text is the most simple data type to work with because it can contain any sequences of letters, numbers, or symbols.

Examples of the text data type
I love sour gummies
2 + 2 = 4
I have 2 cats and 1 gold fish.
thisisnotarealemail@somesite.com
2
=average(B2:4)

Notice that in the second to last example, the text cell only contains what we would understand as a number. If we have not specified that the cell should contain text, then Excel will automatically store 2 as a number. But if we have set the type of the cell to text then Excel will store the character (and not the number) 2.

The last example is an Excel formula, however, stored as text the it won’t be understood as a formula and will not perform the intended operation, which in this case would be to calculate the average of the values contained in cells B2 to B4 of the Excel spreadsheet.

Text data is very limited in terms in terms of the type of analysis that we can do with it. We can perform operation like counting the number of characters a cell contains, or testing whether a cell contains a specific character sequences (e.g., identify all the cells that contain the word data in a spreadsheet). The most frequent use of text data is calculating frequencies (counting the number of times a specific text appears in a spreadsheet. We will explore this in more detail later in the course, but for now, here is an quick example of a dataset of names of which we could calculate frequencies.

Name
John
Julia
John
Jim
Julia
Julia

Using this data, we can determine the frequency of each name, which essentially means counting the number of rows that contain each specific name in the dataset, which in this case would yield the following result:

Name Frequency (or count)
John 2
Jim 1
Julia 3

3.3 Numbers

The second most fundamental data type that every one needs to understand is numbers or numerical data. As we have seen, it is perfectly possible to store the symbol 2 as text, but by default or if specified by the user, Excel would treat the symbol 2 as a number.

There are two basic types of numbers: numbers without decimals (also know as integer) and numbers with decimals. In real life, we encounters numbers like temperature, prices. But what they represented does not matter for a computer. The number 7 is the number 7, whether it is dollars or degrees is irrelevant. However, Excel is a flexible tools that allows users to format and present numbers in ways that are meaningful to them, so you can choose to display a number as a monetary value, as a percentage, a fraction, or in scientific notation.

Unlike text data, a wide range of mathematical operations can be applied to numerical data, such as addition, substraction, multiplication, division. We can summarize numerical data by calculating things like averages, sum, minimum, maximum, standard deviation, and frequencies, whereas, as we learned, we can only calculate the frequencies for textual data.

3.4 Dates and times

Dates and times are a special data type because dates are represented as a day, a month, and a year, and times are represented as an hour, a minute, and a number of second. In fact, they are stored in the computer as simple numbers. For example, if you ask Excel to represent the number 1 as a date, you will get the first of January, 1900, or 1900-01-01. Times are stored by the computer as decimals. For example, the first of January goes from 12:00 am (1.0000) to 11:59:59 pm (1.99999).

While different systems may use different methods to store dates as numbers. The most important thing to remember is that what you see as a date in Excel or in other software is actually stored as a number. Why does that matter? Because it means that you can apply to dates the same operations as you would to numbers. For example, if you add 1 to the date 1900-01-31, you obtain 1900-02-01 (and not something weird like 1900-01-32, which would not be valid).

Dates and times can be decomposed into their different parts. We will revisit this in the data processing chapter, but for now here’s an example of a date and time (2012-03-21 8:35:00 AM) decomposed using excel functions (year, month, day, hour, minute, second) that we will learn more about later.

Formula Outcome
=YEAR(“2012-03-21 8:35:00 AM”) 2012
=MONTH(“2012-03-21 8:35:00 AM”) 3
=DAY(“2012-03-21 8:35:00 AM”) 21
=HOUR(“2012-03-21 8:35:00 AM”) 8
=MINUTE(“2012-03-21 8:35:00 AM”) 35
=SECOND(“2012-03-21 8:35:00 AM”) 0

3.5 Logical (boolean)

A logical data type, also known as a Boolean data type, is use o represent two possible values: TRUE or FALSE. These two values are stored in Excel as 1 (TRUE) or 0 (FALSE), although some programming languages will interpret 0 as FALSE and any other value as TRUE. To demonstrate this we can sum the logical values and see what the results are:

A B =A+B
FALSE FALSE 0
TRUE FALSE 1
FALSE TRUE 1
TRUE TRUE 2

The logical data type is often encountered in Excel as the result of a logical operation (a test) applied to some variable. We could ask, Excel, for example, to test whether the content of a cell is greater than the content of another cell. If that is true, then Excel will return the logical value TRUE, and if not it will return the logical value FALSE. Here’s an example:

A B =A<B (A is smaller than B)
72 21 FALSE
14 14 FALSE
12 21 TRUE
Apple Banana TRUE

We can see in the last example that logical tests can also be applied to text. In this case, Excel checks if Apple comes before Banana in the alphabetical order.

3.6 Summary

In this chapter, you learn that:

  • Data types are imporant for storage efficiency, data integrity, and data analysis
  • The two fundamental data types are text and numbers.
  • Numbers can be displayed in different ways such as monetary values, dates, and times.
  • A wide range of mathematical functions can be applied to numerical data, but not to textual data.
  • Dates and times are really stored as numbers
  • Dates and times can be decomposed in year, month, day, hour, minute, and second.
  • The logical (or boolean) data type takes two possible values (TRUE or FALSE), which are stored as 1 and 0, respectively.
  • The result of logical operations are stored as logical values.