Working with data - 9 Measuring the relationship between two variables

9.1 Two categorical variables

When we want to combine multiple categorical variable to analyze them jointly, we can use a contingency table (aka cross-tabulation or crosstab), which displays the frequency (or relative frequency) distribution two variables. Here’s an example from the Titanic dataset, where we combined the Embark and the Sex columns

Table 9.1. Frequency of male and female Titanic passengers by port of embarkation
	female	male	Sum
Cherbourg	113	157	270
Queenstown	60	63	123
Southampton	291	623	914
Sum	464	843	1307

In this table:

The column represent the categories of one variable (sex: male and female).
The rows represent the categories of another variable (Embark: Cherbourg, Queenstown, Southampton).
The cells show the frequency counts of the combinations of these categories.

9.1.1 Demo

9.2 A categorical and a numerical variable

As we saw in Chapter 8, numerical data can be summarized using a descriptive statistics summary that includes measures of central tendency and measures of dispersion. To summarize the relationship between a categorical variable and a numerical, we simply have to calculate these measures for each group of the categorical variable. For example, the table belows shows the descriptive statistics summary for the fare variable for each passenger class.

Table 9.2. Descriptive statistics summary of fare for each passenger class
pclass	N	Mean	SD	VAR	Q1	Median	Q3	Max
1	323	87.509	80.447	6471.748	30.696	60.000	107.662	512.329
2	277	21.179	13.607	185.154	13.000	15.046	26.000	73.500
3	708	13.303	11.494	132.120	7.750	8.050	15.246	69.550

9.2.1 Demo

9.3 Two numerical variables

The relationship between two numerical variable can be captured by a single number called the correlation coefficient. It is calculated using the CORREL(x:x,y:y) function in Excel, where x:x is a column of numerical data, and y:y is another column of numerical data.

Suppose you have the following data:

A	B
1	2
2	4
3	6
4	8
5	10

To find the correlation coefficient between columns A and B, you would type =CORREL(A1:A5, B1:B5) in a new cell. The result will be 1, which indicates a perfect positive correlation. 0 would mean that there is no correlation at all (no relationship between the two variables), and -1 would indicate a perfect negative correlation. The closer the value is to 1 or -1, the stronger the linear relationship between the two variables.

9.1 Two categorical variables

9.1.1 Demo

9.2 A categorical and a numerical variable

9.2.1 Demo

9.3 Two numerical variables

9.3.1 Demo