female | male | Sum | |
---|---|---|---|
Cherbourg | 113 | 157 | 270 |
Queenstown | 60 | 63 | 123 |
Southampton | 291 | 623 | 914 |
Sum | 464 | 843 | 1307 |
9 Measuring the relationship between two variables
9.1 Two categorical variables
When we want to combine multiple categorical variable to analyze them jointly, we can use a contingency table (aka cross-tabulation or crosstab), which displays the frequency (or relative frequency) distribution two variables. Here’s an example from the Titanic dataset, where we combined the Embark and the Sex columns
In this table:
- The column represent the categories of one variable (sex: male and female).
- The rows represent the categories of another variable (Embark: Cherbourg, Queenstown, Southampton).
- The cells show the frequency counts of the combinations of these categories.
9.1.1 Demo
9.2 A categorical and a numerical variable
As we saw in Chapter 8, numerical data can be summarized using a descriptive statistics summary that includes measures of central tendency and measures of dispersion. To summarize the relationship between a categorical variable and a numerical, we simply have to calculate these measures for each group of the categorical variable. For example, the table belows shows the descriptive statistics summary for the fare variable for each passenger class.
pclass | N | Mean | SD | VAR | Q1 | Median | Q3 | Min | Max |
---|---|---|---|---|---|---|---|---|---|
1 | 323 | 87.509 | 80.447 | 6471.748 | 30.696 | 60.000 | 107.662 | 0 | 512.329 |
2 | 277 | 21.179 | 13.607 | 185.154 | 13.000 | 15.046 | 26.000 | 0 | 73.500 |
3 | 708 | 13.303 | 11.494 | 132.120 | 7.750 | 8.050 | 15.246 | 0 | 69.550 |
9.2.1 Demo
9.3 Two numerical variables
The relationship between two numerical variable can be captured by a single number called the correlation coefficient. It is calculated using the CORREL(x:x,y:y)
function in Excel, where x:x
is a column of numerical data, and y:y
is another column of numerical data.
Suppose you have the following data:
A | B |
---|---|
1 | 2 |
2 | 4 |
3 | 6 |
4 | 8 |
5 | 10 |
To find the correlation coefficient between columns A and B, you would type =CORREL(A1:A5, B1:B5)
in a new cell. The result will be 1, which indicates a perfect positive correlation. 0 would mean that there is no correlation at all (no relationship between the two variables), and -1 would indicate a perfect negative correlation. The closer the value is to 1 or -1, the stronger the linear relationship between the two variables.