In this chapter, we use numbers and tables to summarize our dataset. These summaries can sometimes suffice to fulfill the goals of our analysis when these goals are descriptive. The summaries also help us (and our readers) get to know more about our data and help us ensure that our data is adequate to perform the intended analyses. First we will consider what types of variables we have in our dataset (this is related, but not the same as the R data types), and then we will go through the process of generating useful summaries for variables of different types.
6.3 Types of variable
6.3.1 Categorical variable
Categorical variables are groups or categories. They can be represented by characters or numbers
Nominal variables represent categories or groups (e.g., gender, occupation, course, programs, university) and where there is no logical order between the different categories.
Ordinal variables represent categories or groups that have a logical order (e.g., age groups)
6.3.2 Numerical variables
Numerical variables are represented by numbers
discrete numerical variables can only take a certain number of values (like the numbers on a die or the number of pets a person has). Another way to think about those is that they are things that can be counted (number of cars, number of students, number of pets).
Continuous variables can be measured and can theoretically take any value (e.g., the weight or height of a person, the distance between two cities, a price). They are things that can be measured.
Categories represented with numbers
It is important to look at your data to understand what the values represent. Sometimes you may have groups that are represented with numbers. When deciding what type of statistical analysis is adequate for a given variable, you most likely will want to consider treating those variables as categorical and not numerical..
The following code creates a table that describes the variables included in the mpg dataset.
tibble(Variable =colnames(mpg),Type =c("Nominal","Nominal","Continuous","Discrete","Discrete","Nominal","Nominal","Continuous","Continuous","Nominal","Nominal"),Description =c("Manufacturer name","Model name","Engine displacement, in litres","Year of manufacture","Number of cylinders","Type of transmission","Type of drive train","City miles per gallon","Highway miles per gallon","Fuel type","Type of car") ) %>%kbl(caption ="Variables in the mpg dataset.",align =c("l","l","l") ) %>%kable_classic()
Variables in the mpg dataset.
Variable
Type
Description
manufacturer
Nominal
Manufacturer name
model
Nominal
Model name
displ
Continuous
Engine displacement, in litres
year
Discrete
Year of manufacture
cyl
Discrete
Number of cylinders
trans
Nominal
Type of transmission
drv
Nominal
Type of drive train
cty
Continuous
City miles per gallon
hwy
Continuous
Highway miles per gallon
fl
Nominal
Fuel type
class
Nominal
Type of car
6.4 Summarizing categorical data
There is not a lot that you can do with a single categorical variable other than reporting the frequency (number) and relative frequency (percentage) of observations for each category.
6.4.1 Frequency
We can use the summarize() and n() functions of the dplyr package (included in the tidyverse) to create a table of frequencies for the manufacturer variable. the group_by() function specifies which categorical variable I want to summarize. If we don’t use the group_by() function, we obtain the number of observations (rows) in the whole tibble.
mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.kbl() %>%kable_classic()
manufacturer
freq
audi
18
chevrolet
19
dodge
37
ford
25
honda
9
hyundai
14
jeep
8
land rover
4
lincoln
3
mercury
4
nissan
13
pontiac
5
subaru
14
toyota
34
volkswagen
27
6.4.2 Relative frequency
The relative frequency is simply the frequency represented as a percentage rather than count. It is obtained by first computing the frequency and then calculating the relative frequency by dividing each counts by the sum of the counts: mutate(rel_freq = freq/sum(freq)).
mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.mutate(rel_freq = freq/sum(freq)) %>%kbl() %>%kable_classic()
manufacturer
freq
rel_freq
audi
18
0.0769231
chevrolet
19
0.0811966
dodge
37
0.1581197
ford
25
0.1068376
honda
9
0.0384615
hyundai
14
0.0598291
jeep
8
0.0341880
land rover
4
0.0170940
lincoln
3
0.0128205
mercury
4
0.0170940
nissan
13
0.0555556
pontiac
5
0.0213675
subaru
14
0.0598291
toyota
34
0.1452991
volkswagen
27
0.1153846
6.4.2.1 Rounding the values
When we calculate the relative frequency, we obtain numbers with a lot of decimals. We can used the round() function to specify the number of decimals we want. The syntax is round(value, number of decimals).
mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.mutate(rel_freq =round(freq/sum(freq),3)) %>%kbl() %>%kable_classic()
manufacturer
freq
rel_freq
audi
18
0.077
chevrolet
19
0.081
dodge
37
0.158
ford
25
0.107
honda
9
0.038
hyundai
14
0.060
jeep
8
0.034
land rover
4
0.017
lincoln
3
0.013
mercury
4
0.017
nissan
13
0.056
pontiac
5
0.021
subaru
14
0.060
toyota
34
0.145
volkswagen
27
0.115
6.4.2.2 Converting the relative frequency to percentages
Another thing we might want to do is show the relative frequency as a percentage. This can be done by multiplying the relative frequency by 100.
mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.mutate(pct =round(freq/sum(freq)*100,1)) %>%kbl() %>%kable_classic()
manufacturer
freq
pct
audi
18
7.7
chevrolet
19
8.1
dodge
37
15.8
ford
25
10.7
honda
9
3.8
hyundai
14
6.0
jeep
8
3.4
land rover
4
1.7
lincoln
3
1.3
mercury
4
1.7
nissan
13
5.6
pontiac
5
2.1
subaru
14
6.0
toyota
34
14.5
volkswagen
27
11.5
6.4.2.3 Ordering the categories
I can use arrange() to reorder my table using alphabetical order or frequency. The syntax of the arrange function is arrange(x, variable to use for ordering). The variables are ordered in ascending order by default. To arrange your variable in descending order, you use desc() like this arrange(x, desc(variable to use for ordering)) . You can use arrange with numeric values or characters.
6.4.2.3.1 Order by frequency (ascending)
mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.mutate(pct =round(freq/sum(freq)*100,1)) %>%arrange(freq) %>%kbl() %>%kable_classic()
manufacturer
freq
pct
lincoln
3
1.3
land rover
4
1.7
mercury
4
1.7
pontiac
5
2.1
jeep
8
3.4
honda
9
3.8
nissan
13
5.6
hyundai
14
6.0
subaru
14
6.0
audi
18
7.7
chevrolet
19
8.1
ford
25
10.7
volkswagen
27
11.5
toyota
34
14.5
dodge
37
15.8
6.4.2.3.2 Order by frequency (decreasing)
mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.mutate(pct =round(freq/sum(freq)*100,1)) %>%arrange(desc(freq)) %>%kbl() %>%kable_classic()
manufacturer
freq
pct
dodge
37
15.8
toyota
34
14.5
volkswagen
27
11.5
ford
25
10.7
chevrolet
19
8.1
audi
18
7.7
hyundai
14
6.0
subaru
14
6.0
nissan
13
5.6
honda
9
3.8
jeep
8
3.4
pontiac
5
2.1
land rover
4
1.7
mercury
4
1.7
lincoln
3
1.3
6.4.3 Adding a total
In chapter 3 we learned about the bind_rows() function that can be used to append a tibble to combine two tibbles. So the process for adding a new row with a total is:
Store you frequency table in an object
Create a new tibble that contains the totals. (important: make sure that the column names of this tibble are exactly the same as your frequency table, otherwise, bind_rows() will create new columns).
Use bind_rows() to add the tibble with the total to the frequency table.
# Store you frequency table in an objecttable <- mpg %>%group_by(manufacturer) %>%summarize(freq =n()) %>%# n() counts the number of observations for each group.mutate(pct =round(freq/sum(freq)*100,1)) %>%arrange(desc(freq))# Create a new tibble that contains the totals.totals <- table %>%summarize(freq =sum(freq),pct =sum(pct)) %>%mutate(manufacturer ="Total") %>%select(manufacturer, freq, pct)# Use bind_rows() to combine the tibblestable %>%bind_rows(totals) %>%kbl() %>%kable_classic()
manufacturer
freq
pct
dodge
37
15.8
toyota
34
14.5
volkswagen
27
11.5
ford
25
10.7
chevrolet
19
8.1
audi
18
7.7
hyundai
14
6.0
subaru
14
6.0
nissan
13
5.6
honda
9
3.8
jeep
8
3.4
pontiac
5
2.1
land rover
4
1.7
mercury
4
1.7
lincoln
3
1.3
Total
234
99.9
6.5 Summarizing numerical data
Summarizing numerical data is not done with frequency tables but with statistical summaries that include various measures that can be divided into three groups:
Measures of centrality
Measures of dispersion
Measures of skewness
6.5.1 Measures of centrality
Statistic
description
formula
R function
Mean
The sum of values divided by the number of observations
\[
\overline{X} = \frac{\sum{X}}{n}
\]
mean(x)
Median
the middle value of the variable once sorted in ascending or descending order
The value under which 25% (Q1), 50% (Q2, also the median), and 75% (Q3) data points are found when arranged in increasing order.
Q1 = uantile(x, 0.25)
Q2 = quantile(x, 0.5)
Q3 = quantile(x, 0.75)
6.5.3 Measures of symmetry
The psych package includes two functions to calculate the skewness (skew()) and the kurtosis (kurtosi()). These measures tell you if the values deviate from the normal distribution. A skewness above 1 or below –1 indicates a skewed distribution to the right or the left, respectively. A kurtosis above 1 or below -1 indicates a distribution that is too peaked or too flat, respectively.
6.6 Creating a descriptive statistics summary
We can easily create a table with the descriptive statistics summary for as many numerical variables as we want using the pivot_longer(), group_by() and summarize() functions that you already learned about. In the code below, I reduce the size of the fonts in my table with kable_style(font_size = 10) so that the table can fit on the page.
library(psych) # Load the psych library for the skew() and kurtosi() functionsmpg %>%pivot_longer(c("displ","hwy","cty"), # this is where we specify which variables to includenames_to ="variable", values_to ="value") %>%group_by(variable) %>%summarize(n =n(),mean =mean(value),sd =sd(value),var =var(value),q1 =quantile(value,0.25),median =median(value),q3 =quantile(value,0.75),min =min(value),max =max(value),skew =skew(value),kurtosis =kurtosi(value) ) %>%kbl() %>%kable_styling(font_size =10)
variable
n
mean
sd
var
q1
median
q3
min
max
skew
kurtosis
cty
234
16.858974
4.255946
18.113074
14.0
17.0
19.0
9.0
35
0.7863773
1.4305385
displ
234
3.471795
1.291959
1.669158
2.4
3.3
4.6
1.6
7
0.4386361
-0.9105615
hwy
234
23.440171
5.954643
35.457779
18.0
24.0
27.0
12.0
44
0.3645158
0.1369447
6.7 Summary
In this chapter, we learned how to produce clear and well presented tables to summarize data and help us and our readers understand the data and ensure that it is adequate for the analyses.
6.8 Homework
The homework for this week is lab 4, in which you will summarize a dataset of your choice.