6  Summarizing data

6.1 Learning objectives

  • Frequency tables
  • Descriptive statistics

6.2 Introduction

In this chapter, we use numbers and tables to summarize our dataset. These summaries can sometimes suffice to fulfill the goals of our analysis when these goals are descriptive. The summaries also help us (and our readers) get to know more about our data and help us ensure that our data is adequate to perform the intended analyses. First we will consider what types of variables we have in our dataset (this is related, but not the same as the R data types), and then we will go through the process of generating useful summaries for variables of different types.

6.3 Types of variable

6.3.1 Categorical variable

Categorical variables are groups or categories. They can be represented by characters or numbers

  • Nominal variables represent categories or groups (e.g., gender, occupation, course, programs, university) and where there is no logical order between the different categories.

  • Ordinal variables represent categories or groups that have a logical order (e.g., age groups)

6.3.2 Numerical variables

Numerical variables are represented by numbers

  • discrete numerical variables can only take a certain number of values (like the numbers on a die or the number of pets a person has). Another way to think about those is that they are things that can be counted (number of cars, number of students, number of pets).

  • Continuous variables can be measured and can theoretically take any value (e.g., the weight or height of a person, the distance between two cities, a price). They are things that can be measured.

Categories represented with numbers

It is important to look at your data to understand what the values represent. Sometimes you may have groups that are represented with numbers. When deciding what type of statistical analysis is adequate for a given variable, you most likely will want to consider treating those variables as categorical and not numerical..

The following code creates a table that describes the variables included in the mpg dataset.

tibble(
    Variable = colnames(mpg),
    Type = c("Nominal",
             "Nominal",
             "Continuous",
             "Discrete",
             "Discrete",
             "Nominal",
             "Nominal",
             "Continuous",
             "Continuous",
             "Nominal",
             "Nominal"),
    Description = c("Manufacturer name",
             "Model name",
             "Engine displacement, in litres",
             "Year of manufacture",
             "Number of cylinders",
             "Type of transmission",
             "Type of drive train",
             "City miles per gallon",
             "Highway miles per gallon",
             "Fuel type",
             "Type of car")
  ) %>% 
    kbl(
    caption = "Variables in the mpg dataset.",
    align = c("l","l","l")
    ) %>% 
  kable_classic()
Variables in the mpg dataset.
Variable Type Description
manufacturer Nominal Manufacturer name
model Nominal Model name
displ Continuous Engine displacement, in litres
year Discrete Year of manufacture
cyl Discrete Number of cylinders
trans Nominal Type of transmission
drv Nominal Type of drive train
cty Continuous City miles per gallon
hwy Continuous Highway miles per gallon
fl Nominal Fuel type
class Nominal Type of car

6.4 Summarizing categorical data

There is not a lot that you can do with a single categorical variable other than reporting the frequency (number) and relative frequency (percentage) of observations for each category.

6.4.1 Frequency

We can use the summarize() and n() functions of the dplyr package (included in the tidyverse) to create a table of frequencies for the manufacturer variable. the group_by() function specifies which categorical variable I want to summarize. If we don’t use the group_by() function, we obtain the number of observations (rows) in the whole tibble.

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  kbl() %>% 
  kable_classic()
manufacturer freq
audi 18
chevrolet 19
dodge 37
ford 25
honda 9
hyundai 14
jeep 8
land rover 4
lincoln 3
mercury 4
nissan 13
pontiac 5
subaru 14
toyota 34
volkswagen 27

6.4.2 Relative frequency

The relative frequency is simply the frequency represented as a percentage rather than count. It is obtained by first computing the frequency and then calculating the relative frequency by dividing each counts by the sum of the counts: mutate(rel_freq = freq/sum(freq)).

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(rel_freq = freq/sum(freq)) %>% 
  kbl() %>% 
  kable_classic()
manufacturer freq rel_freq
audi 18 0.0769231
chevrolet 19 0.0811966
dodge 37 0.1581197
ford 25 0.1068376
honda 9 0.0384615
hyundai 14 0.0598291
jeep 8 0.0341880
land rover 4 0.0170940
lincoln 3 0.0128205
mercury 4 0.0170940
nissan 13 0.0555556
pontiac 5 0.0213675
subaru 14 0.0598291
toyota 34 0.1452991
volkswagen 27 0.1153846

6.4.2.1 Rounding the values

When we calculate the relative frequency, we obtain numbers with a lot of decimals. We can used the round() function to specify the number of decimals we want. The syntax is round(value, number of decimals).

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(rel_freq = round(freq/sum(freq),3)) %>% 
  kbl() %>% 
  kable_classic()
manufacturer freq rel_freq
audi 18 0.077
chevrolet 19 0.081
dodge 37 0.158
ford 25 0.107
honda 9 0.038
hyundai 14 0.060
jeep 8 0.034
land rover 4 0.017
lincoln 3 0.013
mercury 4 0.017
nissan 13 0.056
pontiac 5 0.021
subaru 14 0.060
toyota 34 0.145
volkswagen 27 0.115

6.4.2.2 Converting the relative frequency to percentages

Another thing we might want to do is show the relative frequency as a percentage. This can be done by multiplying the relative frequency by 100.

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  kbl() %>% 
  kable_classic()
manufacturer freq pct
audi 18 7.7
chevrolet 19 8.1
dodge 37 15.8
ford 25 10.7
honda 9 3.8
hyundai 14 6.0
jeep 8 3.4
land rover 4 1.7
lincoln 3 1.3
mercury 4 1.7
nissan 13 5.6
pontiac 5 2.1
subaru 14 6.0
toyota 34 14.5
volkswagen 27 11.5

6.4.2.3 Ordering the categories

I can use arrange() to reorder my table using alphabetical order or frequency. The syntax of the arrange function is arrange(x, variable to use for ordering). The variables are ordered in ascending order by default. To arrange your variable in descending order, you use desc() like this arrange(x, desc(variable to use for ordering)) . You can use arrange with numeric values or characters.

6.4.2.3.1 Order by frequency (ascending)
mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  arrange(freq) %>% 
  kbl() %>% 
  kable_classic()
manufacturer freq pct
lincoln 3 1.3
land rover 4 1.7
mercury 4 1.7
pontiac 5 2.1
jeep 8 3.4
honda 9 3.8
nissan 13 5.6
hyundai 14 6.0
subaru 14 6.0
audi 18 7.7
chevrolet 19 8.1
ford 25 10.7
volkswagen 27 11.5
toyota 34 14.5
dodge 37 15.8
6.4.2.3.2 Order by frequency (decreasing)
mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  arrange(desc(freq)) %>% 
  kbl() %>% 
  kable_classic()
manufacturer freq pct
dodge 37 15.8
toyota 34 14.5
volkswagen 27 11.5
ford 25 10.7
chevrolet 19 8.1
audi 18 7.7
hyundai 14 6.0
subaru 14 6.0
nissan 13 5.6
honda 9 3.8
jeep 8 3.4
pontiac 5 2.1
land rover 4 1.7
mercury 4 1.7
lincoln 3 1.3

6.4.3 Adding a total

In chapter 3 we learned about the bind_rows() function that can be used to append a tibble to combine two tibbles. So the process for adding a new row with a total is:

  1. Store you frequency table in an object
  2. Create a new tibble that contains the totals. (important: make sure that the column names of this tibble are exactly the same as your frequency table, otherwise, bind_rows() will create new columns).
  3. Use bind_rows() to add the tibble with the total to the frequency table.
# Store you frequency table in an object
table <- mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  arrange(desc(freq))

# Create a new tibble that contains the totals.
totals <- table %>% 
  summarize(freq = sum(freq),
         pct = sum(pct)) %>% 
  mutate(manufacturer = "Total") %>% 
  select(manufacturer, freq, pct)
  
# Use bind_rows() to combine the tibbles
  
table %>% 
  bind_rows(totals) %>% 
  kbl() %>% 
  kable_classic()
manufacturer freq pct
dodge 37 15.8
toyota 34 14.5
volkswagen 27 11.5
ford 25 10.7
chevrolet 19 8.1
audi 18 7.7
hyundai 14 6.0
subaru 14 6.0
nissan 13 5.6
honda 9 3.8
jeep 8 3.4
pontiac 5 2.1
land rover 4 1.7
mercury 4 1.7
lincoln 3 1.3
Total 234 99.9

6.5 Summarizing numerical data

Summarizing numerical data is not done with frequency tables but with statistical summaries that include various measures that can be divided into three groups:

  1. Measures of centrality
  2. Measures of dispersion
  3. Measures of skewness

6.5.1 Measures of centrality

Statistic description formula R function
Mean The sum of values divided by the number of observations \[ \overline{X} = \frac{\sum{X}}{n} \] mean(x)
Median the middle value of the variable once sorted in ascending or descending order

If n is odd:

\[ M_x = x_\frac{n + 1}{2} \]

If n is even:

\[ M_x = \frac{x_{(n/2)} + x_{(n/2)+1}}{2} \]

median(x)
Mode Most frequent value(s) of a variable N/A N/A (see below)

While there are no functions in R to calculate the mode, you can create your own function. source: https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode

modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

modes(c(1,2,3,4,4,5,5,6,7,8,9))
[1] 4 5

6.5.2 Measures of dispersion

Statistic Definition Formula R function
Variance (Var) Expected squared deviation from the mean. Measures how far numbers spread around the average \[ Var = \frac{\sum{(x_i-\overline{x})^2}}{N} \] var(x)
Standard deviation (SD Square root of the vVariance. \[ SD = \sqrt{\frac{\sum{(x_i-\overline{x})^2}}{N}} \] sd(x)
Minimum (Min) Minimum value of a variable N/A min(x)
Maximum (Max) Maximum value of a variable N/A max(x)
Quartiles The value under which 25% (Q1), 50% (Q2, also the median), and 75% (Q3) data points are found when arranged in increasing order.

Q1 = uantile(x, 0.25)

Q2 = quantile(x, 0.5)

Q3 = quantile(x, 0.75)

6.5.3 Measures of symmetry

The psych package includes two functions to calculate the skewness (skew()) and the kurtosis (kurtosi()). These measures tell you if the values deviate from the normal distribution. A skewness above 1 or below –1 indicates a skewed distribution to the right or the left, respectively. A kurtosis above 1 or below -1 indicates a distribution that is too peaked or too flat, respectively.

6.6 Creating a descriptive statistics summary

We can easily create a table with the descriptive statistics summary for as many numerical variables as we want using the pivot_longer(), group_by() and summarize() functions that you already learned about. In the code below, I reduce the size of the fonts in my table with kable_style(font_size = 10) so that the table can fit on the page.

library(psych) # Load the psych library for the skew() and kurtosi() functions

mpg %>%
  pivot_longer(c("displ","hwy","cty"), # this is where we specify which variables to include
               names_to = "variable", 
               values_to = "value") %>% 
  group_by(variable) %>% 
  summarize(n = n(),
            mean = mean(value),
            sd = sd(value),
            var = var(value),
            q1 = quantile(value,0.25),
            median = median(value),
            q3 = quantile(value,0.75),
            min = min(value),
            max = max(value),
            skew = skew(value),
            kurtosis = kurtosi(value)
            ) %>% 
  kbl() %>% 
  kable_styling(font_size = 10)
variable n mean sd var q1 median q3 min max skew kurtosis
cty 234 16.858974 4.255946 18.113074 14.0 17.0 19.0 9.0 35 0.7863773 1.4305385
displ 234 3.471795 1.291959 1.669158 2.4 3.3 4.6 1.6 7 0.4386361 -0.9105615
hwy 234 23.440171 5.954643 35.457779 18.0 24.0 27.0 12.0 44 0.3645158 0.1369447

6.7 Summary

In this chapter, we learned how to produce clear and well presented tables to summarize data and help us and our readers understand the data and ensure that it is adequate for the analyses.

6.8 Homework

The homework for this week is lab 4, in which you will summarize a dataset of your choice.