MTH-161D | Spring 2025 | University of Portland
February 10, 2025
These slides are derived from Diez et al. (2012).
Data Matrix
Data collected on students in a statistics class on a variety of variables:
Example Data Matrix
Exploratory Analysis
It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.
Descriptive statistics
It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.
For Numerical Variables
For Categorical Variables
The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.
The sinking of the Titanic illustration by Willy Stöwer
The Titanic Data Matrix
The number of observations in the data set is \(2207\) passengers.
A table that summarizes data for two categorical variables is called a contingency table.
Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.
1st | 2nd | 3rd | crew | Sum | |
---|---|---|---|---|---|
no | 123 | 166 | 528 | 679 | 1496 |
yes | 201 | 118 | 181 | 211 | 711 |
Sum | 324 | 284 | 709 | 890 | 2207 |
For those who survived, what is the proportion of passenger class?
Contingency Table:
1st | 2nd | 3rd | crew | Sum | |
---|---|---|---|---|---|
no | 123 | 166 | 528 | 679 | 1496 |
yes | 201 | 118 | 181 | 211 | 711 |
Sum | 324 | 284 | 709 | 890 | 2207 |
To answer this question we examine the row proportions:
There appears to be almost equal proportion of passenger class who survived. Note that this is only considering cases who survived.
Does there appear to be a relationship between class and survival for passengers on the Titanic?
Contingency Table:
1st | 2nd | 3rd | crew | Sum | |
---|---|---|---|---|---|
no | 123 | 166 | 528 | 679 | 1496 |
yes | 201 | 118 | 181 | 211 | 711 |
Sum | 324 | 284 | 709 | 890 | 2207 |
To answer this question we examine the column proportions:
The disproportionate survival of 1st class passengers suggests a relationship between class and survival.
\(\star\) Key Idea: Row and column proportions answer different questions and both are important. Row proportions show a near-equal passenger class split among survivors, while column proportions reveal a higher 1st class survival rate, reflecting the event when they prioritized the 1st class passengers for the life boats.
A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.
How are bar plots different than histograms?
Bar plots are graphical representations of categorical data using rectangular bars of varying heights.
Features:
Types of Bar Plots:
Stacked bar plot: Graphical display of contingency table information, for counts.
Side-by-side bar plot: Displays the same information by placing bars next to, instead of on top of, each other.
Standardized stacked bar plot: Graphical display of contingency table information, for proportions.
\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.
A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.
Features:
A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.
Features:
Example: Age distribution among class passengers and survival.
\(\star\) The boxplot visually compares the spread, median, and possible outliers of age distributions among different classes while distinguishing survival status using color. This helps in analyzing whether age and class had an impact on survival rates on the Titanic.
.pdf
file.