4 Relational databases (Pt. 3)
4.1 Learning objectives
Concepts
- Data types
- Constraints
- Indices
- Documentation
Practical skills
- Set data types and constraints for your fields.
4.2 Data types
Every field you include in a table needs to have a data type. The data type is extremely important as it determines how the relational database management system (RDBMS) will interpret, store, and potentially transform the data inserted in the tables. The data type also determines the types of operations that can be performed with the data it contains.
Unfortunately, the data types that are available to use for your database are also determined by the type of database that you are using. In this course, we will be implementing our database designs on a PostgreSQL database, which is one type of SQL (Structured Query Language) database, others being MySQL, SQLite, and Microsoft SQL Server. These databases can vary in what data types they can handle, but they do share basic data types such as numbers, text, and dates.
In this course, we will limit the data types used to three basic kinds: character, numeric, and dates and times.
4.2.1 Character data types
Character data types are very flexible because they can contain any combination of letters, numbers or symbols. The most common data types are character(n) or char(n)
and character varying(n) or varchar(n)
. The n
sets the maximum number of characters that the field can contain. For example, a field with the data type varchar(5)
will not be able to store strings longer than 5 characters (such as the word “banana”). The term varying
indicates that the field is expected to contain strings of different length. You should use the character(n)
data type only when the length of the strings you will store will not vary (ex., country codes, phone numbers, postal codes), and the character varying
data type should be used when the length of the strings is expected to vary. PostgreSQL also has the text
datatype which is essentially the same as character varying with no limit specification.
More information on the string data types can be found in the PostgreSQL documentation.
Let’s say you are storing names of individuals in one of your database fields. You may be tempted to choose text of character varying without specifying a character limit, since you don’t know the length of the longest name that this field could ever store. However, there is in fact a limit of PostgresSQL 10,485,760 characters (which is about 1GB). Not restricting the length of the string might open the door to pranks, attacks, or errors. For example, one could enter a very long string in the field and blow up the storage requirement of your database. It is therefore recommended to set a limit to your character fields in all cases, but to make sure that the limit is large enough to accommodate all plausible values for the field.
4.2.2 Numeric data types
The numeric data types store… you guessed it: numbers! The main numeric data types available in PostgreSQL are listed in the table below.
Name | Storage Size | Description | Range |
---|---|---|---|
smallint |
2 bytes | small-range integer | -32768 to +32767 |
integer |
4 bytes | typical choice for integer | -2147483648 to +2147483647 |
bigint |
8 bytes | large-range integer | -9223372036854775808 to +9223372036854775807 |
decimal |
variable | user-specified precision, exact | up to 131072 digits before the decimal point; up to 16383 digits after the decimal point |
numeric |
variable | user-specified precision, exact | up to 131072 digits before the decimal point; up to 16383 digits after the decimal point |
real |
4 bytes | variable-precision, inexact | 6 decimal digits precision |
double precision |
8 bytes | variable-precision, inexact | 15 decimal digits precision |
smallserial |
2 bytes | small autoincrementing integer | 1 to 32767 |
serial |
4 bytes | autoincrementing integer | 1 to 2147483647 |
bigserial |
8 bytes | large autoincrementing integer | 1 to 9223372036854775807 |
There are two basic types of numbers: integers (no decimals), and floating-point numbers (allow decimals).
For integers, the main decision you have to make is the storage size. To minimize costs and optimize performance, it is generally best to choose the smallest possible type that will be able to store all plausible values for the field.
Serials are integers that auto-increment, which means that no values need to be inserted in this field when a new entry is created in the table. This is often used for automatically generating unique primary keys.
For values with decimals, I recommend using the numeric
data type that allows you to fix the number of total digits and the number of decimal values. For instance, the numeric(5,2)
data type indicates that the field can contain values with 5 digits and 2 decimals so the range would be -999.99 to 999.99. This is particularly useful for financial data or other cases where having more than a certain number of decimals makes no sense (e.g., 2.333$). If you do not want to fix the precision and scale of your numeric field, you can use the data type numeric
without the parentheses, which will store data with any number of digits before and after th
More information on the numeric data types can be found in the PostgreSQL documentation.
4.2.3 Date and time
The different date and time data types that you can use in MySQL are presented in the following table.
Name | Storage Size | Description | Low Value | High Value | Resolution |
---|---|---|---|---|---|
timestamp without time zone |
8 bytes | both date and time (no time zone) | 4713 BC | 294276 AD | 1 microsecond |
timestamp with time zone |
8 bytes | both date and time, with time zone | 4713 BC | 294276 AD | 1 microsecond |
date |
4 bytes | date (no time of day) | 4713 BC | 5874897 AD | 1 day |
time without time zone |
8 bytes | time of day (no date) | 00:00:00 | 24:00:00 | 1 microsecond |
time with time zone |
12 bytes | time of day (no date), with time zone | 00:00:00+1559 | 24:00:00-1559 | 1 microsecond |
More details on the date and time formats can be found in the PostgreSQL documentation.
PostgreSQL supports a lot more data types (and structures) than those presented above, and the supported data types also vary for different RDBMS, but these systems are constantly evolving and expanding the data types they can handle. But the types presented here should be more than enough for most use cases, and in the context of this class.
4.3 Constraints
SQL constraints are used to specify rules for data in a table. They are used to preserve the integrity of the database, ensure data quality, and prevent data loss. While you do not need to use constraints other than the primary key
and the foreign key
ones for this course, it is helpful to know about constraints and what they do. Here is a list of constraints that are available in PostgreSQL.
Primary key
: As you already know, the primary key must be unique and cannot be empty. The RDBMS also automatically creates an index (explained below) on this field.Not null
: Specifies that this field must always contain a value and cannot be empty (null).Unique
: A field with a unique constraint cannot contain duplicate values.Foreign key
: The value inserted into a foreign key must exist in the primary key field of the parent table. Null values are allowed in foreign key fields.Check
: The check constraint is a very flexible constraint that allows you to specify the specific values that a field can contain. For example, you could set a check constraint on the number of gummies in an order specifying that the value must be greater than 0.Default
: This constraint defines a default value to be inserted in the field when a new record is created, and no value is provided for that specific field.
4.4 Indices
Creating an index for the values of a field can dramatically increase the speed of your queries. Just like an index at the end of a book telling you on what page(s) you can find information about a particular concept or topic instead of having to look at each page to find those sections, the index creates a list of records where each of the different values of that field can be found and stores that list into its memory, so that those records can be retrieved much faster. So, for example, if you have 50 different gummy records in your gummy table, 12 of which are red, 25 are blue, and 13 are green, you could create an index on the gummy_color table. The index would have one record for each colour (red., blue, green), each containing a list of the gummy_id that has that colour.
For example, if you ask the database to retrieve blue gummies from the gummy table and you do not have an index on the colour column, it has to scan each row for “blue.” If the table has 1 million rows, then this verification has to be done 1 million times..
No index on the colour field of the gummy table
With an index on the colour field of the gummy table
If you ask the database to retrieve blue gummies from the gummy table and you have an index on the colour column, the database scans the index until it finds the colour “blue” and then fetches the rows of that are “blue.”.
4.5 Documenting your database
Now that you have a complete database design, with tables and fields with all the appropriate data types and constraints, you are ready to move to the next step: documentation.
Adequately documenting your database is crucial because it ensures that the users will be able to understand what tables contain, why they exist, and how to use your database. You already know (and have applied) best practices for table and field names. That likely reduces the need for your database users to refer to the documentation all the time, but it does not eliminate the need for it. Describing your tables and fields also helps ensure they are needed and well-designed. If you cannot describe the purpose of a table, it may be that it has, in fact, no purpose.
Here are a few simple rules to follow:
Clearly and concisely describe the tables and fields.
For tables only, your description should provide some indication of its purpose (why it is necessary).
Focus on the content of the tables or fields rather than how they will specifically be used or accessed and by whom.
Do not use examples in your descriptions.