Considering Categorical Data

Elementary Statistics

MTH-161D | Spring 2025 | University of Portland

February 10, 2025

Objectives

These slides are derived from Diez et al. (2012).

Previously… (1/3)

Data Matrix

Data collected on students in a statistics class on a variety of variables:

Example Data Matrix

Example Data Matrix

Previously… (2/3)

Exploratory Analysis

It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.

Previously… (3/3)

Descriptive statistics

It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

For Categorical Variables

Case Study 1: Titanic Data Set

The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.

The sinking of the Titanic illustration by Willy Stöwer

The sinking of the Titanic illustration by Willy Stöwer

The Titanic Data Matrix

The Titanic Data Matrix

The number of observations in the data set is \(2207\) passengers.

Contingency Tables

A table that summarizes data for two categorical variables is called a contingency table.

Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

Row Proportions

For those who survived, what is the proportion of passenger class?

Contingency Table:

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

To answer this question we examine the row proportions:

There appears to be almost equal proportion of passenger class who survived. Note that this is only considering cases who survived.

Column Proportions

Does there appear to be a relationship between class and survival for passengers on the Titanic?

Contingency Table:

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

To answer this question we examine the column proportions:

The disproportionate survival of 1st class passengers suggests a relationship between class and survival.

\(\star\) Key Idea: Row and column proportions answer different questions and both are important. Row proportions show a near-equal passenger class split among survivors, while column proportions reveal a higher 1st class survival rate, reflecting the event when they prioritized the 1st class passengers for the life boats.

Bar Plots

A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

Bar Plots vs Histograms

How are bar plots different than histograms?

Bar Plots with Two Variables

Bar plots are graphical representations of categorical data using rectangular bars of varying heights.

Features:

Types of Bar Plots:

Examples of Bar Plots

\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.

Mosaic Plots

A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.

Features:

Pie Charts

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.

Features:

Comparing Categorical Data to Numerical Data

Example: Age distribution among class passengers and survival.

\(\star\) The boxplot visually compares the spread, median, and possible outliers of age distributions among different classes while distinguishing survival status using color. This helps in analyzing whether age and class had an impact on survival rates on the Titanic.

Activity: Interpret Row and Column Proportions

  1. Make sure you have a copy of the M 2/10 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
  2. Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
  3. Get together with another student.
  4. Discuss your results.
  5. Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/