6 Summarizing data

6.1 Learning objectives

Frequency tables
Descriptive statistics

6.2 Introduction

In this chapter, we use numbers and tables to summarize our dataset. These summaries can sometimes suffice to fulfill the goals of our analysis when these goals are descriptive. The summaries also help us (and our readers) get to know more about our data and help us ensure that our data is adequate to perform the intended analyses. First we will consider what types of variables we have in our dataset (this is related, but not the same as the R data types), and then we will go through the process of generating useful summaries for variables of different types.

6.3 Types of variable

6.3.1 Categorical variable

Categorical variables are groups or categories. They can be represented by characters or numbers

Nominal variables represent categories or groups (e.g., gender, occupation, course, programs, university) and where there is no logical order between the different categories.
Ordinal variables represent categories or groups that have a logical order (e.g., age groups)

6.3.2 Numerical variables

Numerical variables are represented by numbers

discrete numerical variables can only take a certain number of values (like the numbers on a die or the number of pets a person has). Another way to think about those is that they are things that can be counted (number of cars, number of students, number of pets).
Continuous variables can be measured and can theoretically take any value (e.g., the weight or height of a person, the distance between two cities, a price). They are things that can be measured.

Categories represented with numbers

It is important to look at your data to understand what the values represent. Sometimes you may have groups that are represented with numbers. When deciding what type of statistical analysis is adequate for a given variable, you most likely will want to consider treating those variables as categorical and not numerical..

The following code creates a table that describes the variables included in the mpg dataset.

tibble(
    Variable = colnames(mpg),
    Type = c("Nominal",
             "Nominal",
             "Continuous",
             "Discrete",
             "Discrete",
             "Nominal",
             "Nominal",
             "Continuous",
             "Continuous",
             "Nominal",
             "Nominal"),
    Description = c("Manufacturer name",
             "Model name",
             "Engine displacement, in litres",
             "Year of manufacture",
             "Number of cylinders",
             "Type of transmission",
             "Type of drive train",
             "City miles per gallon",
             "Highway miles per gallon",
             "Fuel type",
             "Type of car")
  ) %>% 
    kbl(
    caption = "Variables in the mpg dataset.",
    align = c("l","l","l")
    ) %>% 
  kable_classic()

Variables in the mpg dataset.
Variable	Type	Description
manufacturer	Nominal	Manufacturer name
model	Nominal	Model name
displ	Continuous	Engine displacement, in litres
year	Discrete	Year of manufacture
cyl	Discrete	Number of cylinders
trans	Nominal	Type of transmission
drv	Nominal	Type of drive train
cty	Continuous	City miles per gallon
hwy	Continuous	Highway miles per gallon
fl	Nominal	Fuel type
class	Nominal	Type of car

6.4 Summarizing categorical data

There is not a lot that you can do with a single categorical variable other than reporting the frequency (number) and relative frequency (percentage) of observations for each category.

6.4.1 Frequency

We can use the summarize() and n() functions of the dplyr package (included in the tidyverse) to create a table of frequencies for the manufacturer variable. the group_by() function specifies which categorical variable I want to summarize. If we don’t use the group_by() function, we obtain the number of observations (rows) in the whole tibble.

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  kbl() %>% 
  kable_classic()

manufacturer	freq
audi	18
chevrolet	19
dodge	37
ford	25
honda	9
hyundai	14
jeep	8
land rover	4
lincoln	3
mercury	4
nissan	13
pontiac	5
subaru	14
toyota	34
volkswagen	27

6.4.2 Relative frequency

The relative frequency is simply the frequency represented as a percentage rather than count. It is obtained by first computing the frequency and then calculating the relative frequency by dividing each counts by the sum of the counts: mutate(rel_freq = freq/sum(freq)).

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(rel_freq = freq/sum(freq)) %>% 
  kbl() %>% 
  kable_classic()

manufacturer	freq	rel_freq
audi	18	0.0769231
chevrolet	19	0.0811966
dodge	37	0.1581197
ford	25	0.1068376
honda	9	0.0384615
hyundai	14	0.0598291
jeep	8	0.0341880
land rover	4	0.0170940
lincoln	3	0.0128205
mercury	4	0.0170940
nissan	13	0.0555556
pontiac	5	0.0213675
subaru	14	0.0598291
toyota	34	0.1452991
volkswagen	27	0.1153846

6.4.2.1 Rounding the values

When we calculate the relative frequency, we obtain numbers with a lot of decimals. We can used the round() function to specify the number of decimals we want. The syntax is round(value, number of decimals).

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(rel_freq = round(freq/sum(freq),3)) %>% 
  kbl() %>% 
  kable_classic()

manufacturer	freq	rel_freq
audi	18	0.077
chevrolet	19	0.081
dodge	37	0.158
ford	25	0.107
honda	9	0.038
hyundai	14	0.060
jeep	8	0.034
land rover	4	0.017
lincoln	3	0.013
mercury	4	0.017
nissan	13	0.056
pontiac	5	0.021
subaru	14	0.060
toyota	34	0.145
volkswagen	27	0.115

6.4.2.2 Converting the relative frequency to percentages

Another thing we might want to do is show the relative frequency as a percentage. This can be done by multiplying the relative frequency by 100.

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  kbl() %>% 
  kable_classic()

manufacturer	freq	pct
audi	18	7.7
chevrolet	19	8.1
dodge	37	15.8
ford	25	10.7
honda	9	3.8
hyundai	14	6.0
jeep	8	3.4
land rover	4	1.7
lincoln	3	1.3
mercury	4	1.7
nissan	13	5.6
pontiac	5	2.1
subaru	14	6.0
toyota	34	14.5
volkswagen	27	11.5

6.4.2.3 Ordering the categories

I can use arrange() to reorder my table using alphabetical order or frequency. The syntax of the arrange function is arrange(x, variable to use for ordering). The variables are ordered in ascending order by default. To arrange your variable in descending order, you use desc() like this arrange(x, desc(variable to use for ordering)) . You can use arrange with numeric values or characters.

6.4.2.3.1 Order by frequency (ascending)

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  arrange(freq) %>% 
  kbl() %>% 
  kable_classic()

manufacturer	freq	pct
lincoln	3	1.3
land rover	4	1.7
mercury	4	1.7
pontiac	5	2.1
jeep	8	3.4
honda	9	3.8
nissan	13	5.6
hyundai	14	6.0
subaru	14	6.0
audi	18	7.7
chevrolet	19	8.1
ford	25	10.7
volkswagen	27	11.5
toyota	34	14.5
dodge	37	15.8

6.4.2.3.2 Order by frequency (decreasing)

mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  arrange(desc(freq)) %>% 
  kbl() %>% 
  kable_classic()

manufacturer	freq	pct
dodge	37	15.8
toyota	34	14.5
volkswagen	27	11.5
ford	25	10.7
chevrolet	19	8.1
audi	18	7.7
hyundai	14	6.0
subaru	14	6.0
nissan	13	5.6
honda	9	3.8
jeep	8	3.4
pontiac	5	2.1
land rover	4	1.7
mercury	4	1.7
lincoln	3	1.3

6.4.3 Adding a total

In chapter 3 we learned about the bind_rows() function that can be used to append a tibble to combine two tibbles. So the process for adding a new row with a total is:

Store you frequency table in an object
Create a new tibble that contains the totals. (important: make sure that the column names of this tibble are exactly the same as your frequency table, otherwise, bind_rows() will create new columns).
Use bind_rows() to add the tibble with the total to the frequency table.

# Store you frequency table in an object
table <- mpg %>% 
  group_by(manufacturer) %>% 
  summarize(freq = n()) %>% # n() counts the number of observations for each group.
  mutate(pct = round(freq/sum(freq)*100,1)) %>% 
  arrange(desc(freq))

# Create a new tibble that contains the totals.
totals <- table %>% 
  summarize(freq = sum(freq),
         pct = sum(pct)) %>% 
  mutate(manufacturer = "Total") %>% 
  select(manufacturer, freq, pct)
  
# Use bind_rows() to combine the tibbles
  
table %>% 
  bind_rows(totals) %>% 
  kbl() %>% 
  kable_classic()

manufacturer	freq	pct
dodge	37	15.8
toyota	34	14.5
volkswagen	27	11.5
ford	25	10.7
chevrolet	19	8.1
audi	18	7.7
hyundai	14	6.0
subaru	14	6.0
nissan	13	5.6
honda	9	3.8
jeep	8	3.4
pontiac	5	2.1
land rover	4	1.7
mercury	4	1.7
lincoln	3	1.3
Total	234	99.9

6.5 Summarizing numerical data

Summarizing numerical data is not done with frequency tables but with statistical summaries that include various measures that can be divided into three groups:

Measures of centrality
Measures of dispersion
Measures of skewness

6.5.1 Measures of centrality

Statistic description formula R function

Mean The sum of values divided by the number of observations \[ \overline{X} = \frac{\sum{X}}{n} \] mean(x)

Median

the middle value of the variable once sorted in ascending or descending order

Statistic	description	formula	R function
Mean	The sum of values divided by the number of observations	\[ \overline{X} = \frac{\sum{X}}{n} \]	`mean(x)`
Median	the middle value of the variable once sorted in ascending or descending order	If n is odd: \[ M_x = x_\frac{n + 1}{2} \] If n is even: \[ M_x = \frac{x_{(n/2)} + x_{(n/2)+1}}{2} \]	`median(x)`
Mode	Most frequent value(s) of a variable	N/A	N/A (see below)

If n is odd:

\[ M_x = x_\frac{n + 1}{2} \]

If n is even:

\[ M_x = \frac{x_{(n/2)} + x_{(n/2)+1}}{2} \]

median(x)

Mode Most frequent value(s) of a variable N/A N/A (see below)

While there are no functions in R to calculate the mode, you can create your own function. source: https://stackoverflow.com/questions/2547402/how-to-find-the-statistical-mode

modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

modes(c(1,2,3,4,4,5,5,6,7,8,9))

[1] 4 5

6.5.2 Measures of dispersion

Statistic	Definition	Formula	R function
Variance (Var)	Expected squared deviation from the mean. Measures how far numbers spread around the average	\[ Var = \frac{\sum{(x_i-\overline{x})^2}}{N} \]	`var(x)`
Standard deviation (SD	Square root of the vVariance.	\[ SD = \sqrt{\frac{\sum{(x_i-\overline{x})^2}}{N}} \]	`sd(x)`
Minimum (Min)	Minimum value of a variable	N/A	`min(x)`
Maximum (Max)	Maximum value of a variable	N/A	`max(x)`
Quartiles	The value under which 25% (Q1), 50% (Q2, also the median), and 75% (Q3) data points are found when arranged in increasing order.		`Q1 = uantile(x, 0.25)` `Q2 = quantile(x, 0.5)` `Q3 = quantile(x, 0.75)`

6.5.3 Measures of symmetry

The psych package includes two functions to calculate the skewness (skew()) and the kurtosis (kurtosi()). These measures tell you if the values deviate from the normal distribution. A skewness above 1 or below –1 indicates a skewed distribution to the right or the left, respectively. A kurtosis above 1 or below -1 indicates a distribution that is too peaked or too flat, respectively.

6.6 Creating a descriptive statistics summary

We can easily create a table with the descriptive statistics summary for as many numerical variables as we want using the pivot_longer(), group_by() and summarize() functions that you already learned about. In the code below, I reduce the size of the fonts in my table with kable_style(font_size = 10) so that the table can fit on the page.

library(psych) # Load the psych library for the skew() and kurtosi() functions

mpg %>%
  pivot_longer(c("displ","hwy","cty"), # this is where we specify which variables to include
               names_to = "variable", 
               values_to = "value") %>% 
  group_by(variable) %>% 
  summarize(n = n(),
            mean = mean(value),
            sd = sd(value),
            var = var(value),
            q1 = quantile(value,0.25),
            median = median(value),
            q3 = quantile(value,0.75),
            min = min(value),
            max = max(value),
            skew = skew(value),
            kurtosis = kurtosi(value)
            ) %>% 
  kbl() %>% 
  kable_styling(font_size = 10)

variable	n	mean	sd	var	q1	median	q3	min	max	skew	kurtosis
cty	234	16.858974	4.255946	18.113074	14.0	17.0	19.0	9.0	35	0.7863773	1.4305385
displ	234	3.471795	1.291959	1.669158	2.4	3.3	4.6	1.6	7	0.4386361	-0.9105615
hwy	234	23.440171	5.954643	35.457779	18.0	24.0	27.0	12.0	44	0.3645158	0.1369447

6.7 Summary

In this chapter, we learned how to produce clear and well presented tables to summarize data and help us and our readers understand the data and ensure that it is adequate for the analyses.

6.8 Homework

The homework for this week is lab 4, in which you will summarize a dataset of your choice.