9  Measuring the relationship between two variables

9.1 Two categorical variables

When we want to combine multiple categorical variable to analyze them jointly, we can use a contingency table (aka cross-tabulation or crosstab), which displays the frequency (or relative frequency) distribution two variables. Here’s an example from the Titanic dataset, where we combined the Embark and the Sex columns

Table 9.1. Frequency of male and female Titanic passengers by port of embarkation
female male Sum
Cherbourg 113 157 270
Queenstown 60 63 123
Southampton 291 623 914
Sum 464 843 1307

In this table:

  • The column represent the categories of one variable (sex: male and female).
  • The rows represent the categories of another variable (Embark: Cherbourg, Queenstown, Southampton).
  • The cells show the frequency counts of the combinations of these categories.

9.1.1 Demo

9.2 A categorical and a numerical variable

As we saw in Chapter 8, numerical data can be summarized using a descriptive statistics summary that includes measures of central tendency and measures of dispersion. To summarize the relationship between a categorical variable and a numerical, we simply have to calculate these measures for each group of the categorical variable. For example, the table belows shows the descriptive statistics summary for the fare variable for each passenger class.

Table 9.2. Descriptive statistics summary of fare for each passenger class
pclass N Mean SD VAR Q1 Median Q3 Min Max
1 323 87.509 80.447 6471.748 30.696 60.000 107.662 0 512.329
2 277 21.179 13.607 185.154 13.000 15.046 26.000 0 73.500
3 708 13.303 11.494 132.120 7.750 8.050 15.246 0 69.550

9.2.1 Demo

9.3 Two numerical variables

The relationship between two numerical variable can be captured by a single number called the correlation coefficient. It is calculated using the CORREL(x:x,y:y) function in Excel, where x:x is a column of numerical data, and y:y is another column of numerical data.

Suppose you have the following data:

A B
1 2
2 4
3 6
4 8
5 10

To find the correlation coefficient between columns A and B, you would type =CORREL(A1:A5, B1:B5) in a new cell. The result will be 1, which indicates a perfect positive correlation. 0 would mean that there is no correlation at all (no relationship between the two variables), and -1 would indicate a perfect negative correlation. The closer the value is to 1 or -1, the stronger the linear relationship between the two variables.

9.3.1 Demo