9  Logistic regression

9.1 Learning objectives

  • Logistic regression

9.2 Introduction

Regression is a method used to determine the relationship between a dependent variable (the variable we want to predict) and one or more independent variables (the predictors available to make the prediction). There are a wide variety of regression methods, but in this course we will learn two: the logistic regression, which is used to predict a categorical dependent variable, and the linear regression, which is used to predict a continuous dependent variable.

In this chapter, we focus on the binomial logistic regression (we will refer to it as logistic regression or simply regression in the rest of the chapter), which means that our dependent variable is dichotomous (e.g., yes or no, pass vs fail). Ordinal logistic regression (for ordinal dependent variables) and multinominal logistic regression (for variables with more than 2 categories) are beyond the scope of the course.

9.3 Logistic regression

9.3.1 Logistic function

Logistic regression is called this way because it fits a logistic function (an s-shaped curve) to the data to model the relationship between the predictors and a categorical outcome. More specifically, it models the probability of an outcome (the dependent categorical variable) based on the value of the independent variable. Here’s an example:

The model gives us, for each value of the independent variable (Copper content, in this example), the probability (odds) that the painting is an original. The point where the logistic curve reaches .5 (50%) on the y-axis is where the cut-off happens: the model predicts that any painting with a copper content above that point is an original.

9.3.2 Odds

We can convert probabilities into odds by dividing the probability p of the one outcome by the probability of the other outcome, so because there are only two outcomes, then odds = p / 1-p. For example, say we have a bag of 10 balls, 2 red and 8 black. If we draw a ball at random, we have a 8/10 = 80% chance of drawing a black ball. The odds of drawing a black ball are thus 0.8/(1-0.8) = 0.8/0.2 = 4. There is a 4 to 1 chance that we’ll draw a black ball over a red one.

However, the output of the logistic regression model is the natural logarithm of the odds: log odds = ln(p/1-p), so it is not so easily interpreted.

9.4 Logistic regression example

9.4.1 Loading and preparing data

library(tidyverse)
data <- read_csv("https://pmongeon.github.io/info6270/files/data/titanic.csv")
head(data) %>% 
  kbl() %>% 
  kable_classic()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NA S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NA S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NA S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 NA Q

9.4.1.1 Choose a set of predictors (independent variables)

Looking at our dataset, we can identify some variables that we think might affect the probability that a passenger survived. In our model, we will choose Sex, Age, Pclass, and Fare. The SibSp and Parch variables represent, respectively, the combined number of siblings and spouses and the combined number of parents and children a passenger has on board. We will add them together to create a fifth predictor called FamilySize.

data <- data %>% 
  mutate(FamilySize = SibSp + Parch)

We can now remove the variables that we don’t need in our model by selecting the ones we want to keep.

data <- data %>% 
  select(Survived, Sex, Age, Fare, FamilySize, Pclass)

head(data) %>% 
  kbl() %>% 
  kable_classic()
Survived Sex Age Fare FamilySize Pclass
0 male 22 7.2500 1 3
1 female 38 71.2833 1 1
1 female 26 7.9250 0 3
1 female 35 53.1000 1 1
0 male 35 8.0500 0 3
0 male NA 8.4583 0 3

9.4.1.2 Set categorical variables as factors

Setting categorical variables as factors is always necessary when fitting regression models in R. In our case there are three: Sex, Survived, and Pclass.

data <- data %>% 
  mutate(Sex = as_factor(Sex),
         Survived = as_factor(Survived),
         Pclass = as_factor(Pclass))

9.4.1.3 Dealing with missing data

We can count the number of empty cells for each variable to see if some data is missing. We do this for each variable in the set.

sum(is.na(data$Survived))
[1] 0
sum(is.na(data$Sex))
[1] 0
sum(is.na(data$Age))
[1] 177
sum(is.na(data$Fare))
[1] 0
sum(is.na(data$FamilySize))
[1] 0
sum(is.na(data$Pclass))
[1] 0

Once we have identified that some columns contain missing data, we have two choices. We do nothing and these cases will be left out of the regression model, or we fill the empty cells in some way (this is called imputation). We have many missing values (177 out of 891 observations is quite large) and leaving out these observations could negatively affect the performance of our regression model. Therefore, we will assign the average age for all 177 missing age values, which is a typical imputation mechanism to replace NA values with an estimate based on the available data.

# We use na.rm = TRUE otherwise mean(Age) would return NA due to the missing values.
data <- data %>% 
  mutate(Age = replace_na(Age, round(mean(Age, na.rm=TRUE),0)))

head(data) %>% 
  kbl() %>% 
  kable_classic()
Survived Sex Age Fare FamilySize Pclass
0 male 22 7.2500 1 3
1 female 38 71.2833 1 1
1 female 26 7.9250 0 3
1 female 35 53.1000 1 1
0 male 35 8.0500 0 3
0 male 30 8.4583 0 3

9.4.2 Visualizing the relationships

To explore the relationship between variables. we can visualize the distribution of independent variable values for each value of the dependent variable. We can use box plots or violin plots for continuous independent variables and bar charts for the categorical variables. To make the process faster, let’s briefly untidy the data and use the gather() function to create key-value pairs for each observation of the dependent variable.

data_untidy <- gather(data, key = "Variable", value = "Value",
                        -Survived) # Creates key-value pairs for all columns except Survive
head(data_untidy) %>% 
  kbl() %>% 
  kable_classic()
Survived Variable Value
0 Sex male
1 Sex female
1 Sex female
1 Sex female
0 Sex male
0 Sex male

We can now easily create box plots for all our independent variables and outcome.

data_untidy %>% 
  filter(Variable != "Pclass" & Variable != "Sex") %>%
  ggplot(aes(Survived, as.numeric(Value))) +
  facet_wrap(~ Variable, scales = "free_y") +
  geom_boxplot(draw_quantiles = c(0.25, 0.5, 0.75))

And now we make a bar chart for our categorical independent variables.

data_untidy %>%
  filter(Variable == "Pclass" | Variable == "Sex") %>%
  ggplot(aes(Value, fill = Survived)) +
  facet_wrap(~ Variable, scales = "free_x") +
  geom_bar(position = "fill")

9.4.3 Creating the model

The following code generates our logistic regression model using the glm() function (glm stands for general linear model). The syntax is gml(predicted variable ~ predictor1 + predictor2 + preductor3..., data, family) where data is our dataset and the family is the type of regression model we want to create. In our case, the family is binomial.

model <- glm(Survived ~ Sex + Age + Fare + FamilySize + Pclass,
             data = data,
             family = binomial)

9.4.3.1 Model summary

Now that we have created our model, we can look at the coefficients (estimates) which tell us about the relationship between our predictors and the predicted variable. The Pr(>|z|) column represents the p-value, which determines whether the effect observed is statistically significant. It is common to use 0.05 as the threshold for statistical significance, so all the effects in our model are statistically significant (p < 0.05) except for the fare (p > 0.05).

summary(model)$coefficients %>% 
  kbl() %>% 
  kable_classic()
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0377996 0.3932615 2.638956 0.0083162
Sexfemale 2.7763022 0.1985984 13.979481 0.0000000
Age -0.0388614 0.0078206 -4.969087 0.0000007
Fare 0.0032132 0.0024551 1.308796 0.1906036
FamilySize -0.2435093 0.0676841 -3.597733 0.0003210
Pclass2 -1.0021830 0.2929527 -3.420972 0.0006240
Pclass3 -2.1318527 0.2891435 -7.372993 0.0000000

9.4.3.2 Converting log odds to odds ratio

As we mentioned above, the coefficients produced by the model are log odds, which are difficult to interpret. We can convert them to odds ratio, which are easier to interpret. This can be done with the exp() function. We can now see that according to our model, female passengers were 16 times more likely to survive than male passengers.

bind_rows(exp(model$coefficients)) %>% 
  kbl() %>% 
  kable_classic()
(Intercept) Sexfemale Age Fare FamilySize Pclass2 Pclass3
2.822998 16.05953 0.961884 1.003218 0.7838722 0.3670772 0.1186173

9.4.3.3 Adding confidence intervals

The confidence intervals are an estimation of the precision odds ratio. In the example below, We use a 95% confidence interval which means that we are 95% of our estimated coefficients for a predictor are between the 2.5th percentile and the 97.5th percentile (the two values reported in the tables). If we were using a sample to make claims about a population, which does not really apply here due to the unique case of the titanic, we could then think of the confidence interval as indicating a 95% probability that the true coefficient for the entire population is situated in between the two values.

odds_ratio <- cbind(Odds_Ratio = exp(model$coefficients), 
                    exp(confint(model, level = .95)))

odds_ratio %>% 
  kbl() %>% 
  kable_classic()
Odds_Ratio 2.5 % 97.5 %
(Intercept) 2.8229984 1.3083312 6.1328871
Sexfemale 16.0595260 10.9723363 23.9218257
Age 0.9618840 0.9469679 0.9764876
Fare 1.0032184 0.9987147 1.0086269
FamilySize 0.7838722 0.6827452 0.8907643
Pclass2 0.3670772 0.2060871 0.6511201
Pclass3 0.1186173 0.0670531 0.2089431

9.4.3.4 Model predictions

First we add to our data the probability that the passenger survived as calculated by the model.

data <- tibble(data,
               Probability = model$fitted.values)

head(data) %>% 
  kbl() %>% 
  kable_classic()
Survived Sex Age Fare FamilySize Pclass Probability
0 male 22 7.2500 1 3 0.1025490
1 female 38 71.2833 1 1 0.9107566
1 female 26 7.9250 0 3 0.6675927
1 female 35 53.1000 1 1 0.9153720
0 male 35 8.0500 0 3 0.0810373
0 male 30 8.4583 0 3 0.0968507

Then we obtain the prediction by creating a new variable called prediction and setting the value to 1 if the calculated probability of survival is greater than 50% and 0 otherwise.

data <- data %>% 
  mutate(Prediction = if_else(Probability > 0.5,1,0))

head(data) %>% 
  kbl() %>% 
  kable_classic()
Survived Sex Age Fare FamilySize Pclass Probability Prediction
0 male 22 7.2500 1 3 0.1025490 0
1 female 38 71.2833 1 1 0.9107566 1
1 female 26 7.9250 0 3 0.6675927 1
1 female 35 53.1000 1 1 0.9153720 1
0 male 35 8.0500 0 3 0.0810373 0
0 male 30 8.4583 0 3 0.0968507 0

Then we can print a table comparing the model’s prediction to the real data.

table(Survived = data$Survived, 
      Predicted = data$Prediction)
        Predicted
Survived   0   1
       0 476  73
       1 101 241

Finally, we can calculate how well our model fits the data (how good was it at predicting the independent variable) by testing, for each passenger in the dataset, whether the prediction matched the actual outcome for that passenger. Because that test gives TRUE (1) or FALSE (0) for every row of the dataset, calculating the mean of the test results gives us the percentage of correct predictions, which in our case is about 80.5%.

mean(data$Survived == data$Prediction)
[1] 0.8047138

9.5 Summary

In this chapter we introduced the logistic regression model as a useful tool to predict a dichotomous categorical variable based on one or multiple predictors that can be either categorical or continuous. The R scripts provided will work with any dataset provided that they are in a tidy format. So you now have everything you need to try to predict categorical outcomes in all kinds of scenarios and with all kinds of data.

9.6 Lab

For this week’s lab, your task is to build a logistic regression model on a dataset of your choice. I encourage you to use the dataset that you will be using for your individual project, if it contains a dichotomous variable that you can try to predict using other variables in your dataset.