Exploring Categorical Data

Applied Statistics

MTH-361A | Spring 2026 | University of Portland

Objectives

Basics of Exploring Categorical Data

Variables

Visualization Techniques

Titanic Survivors

The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.

The sinking of the Titanic illustration by Willy Stöwer

The sinking of the Titanic illustration by Willy Stöwer

The Titanic Data Matrix

The Titanic Data Matrix

The number of observations in the data set is \(2207\) passengers.

Contingency Tables

A table that summarizes data for two categorical variables is called a contingency table.

Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

Contingency Tables: Using R

Load Packages

library(tidyverse)
library(knitr)

Producing and presenting Contingency Tables in R

# create a new dataframe to store frequencies
titanic_surv_class <- titanic %>% 
  # select variables
  select(survived,class) %>% 
  # group observations by categories
  group_by(survived,class) %>% 
  # count number of observations by categories
  summarise(total = n(),
            .groups = 'drop')

# visualize contingency table
kable(addmargins(xtabs(total ~ survived + class, titanic_surv_class)))

Row Proportions

For those who survived, what is the proportion of passenger class?

Contingency Table:

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

To answer this question we examine the row proportions:

There appears to be almost equal proportion of passenger class who survived. Note that this is only considering cases who survived.

Column Proportions

Does there appear to be a relationship between class and survival for passengers on the Titanic?

Contingency Table:

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

To answer this question we examine the column proportions:

The disproportionate survival of 1st class passengers suggests a relationship between class and survival.

\(\star\) Row and column proportions answer different questions and both are important. Row proportions show a near-equal passenger class split among survivors, while column proportions reveal a higher 1st class survival rate, reflecting the event when they prioritized the 1st class passengers for the life boats.

Bar Plots

A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

Bar Plots: Using R

Load Packages

library(tidyverse)
library(gridExtra)

Plotting Bar Plots (raw counts and proportions)

# create barplot 1 with raw counts
p1 <- ggplot(
  # dataframe
  titanic,
  # aesthetics
  aes(
    # x-axis using a categorical variable
    x = survived
    )
  ) + 
  # draw barplot
  geom_bar() + 
  # relabel title
  ggtitle("Frequencies")

# create barplot 2 with relative frequencies
p2 <- ggplot(
  # dataframe
  titanic,
  # aethetics
  aes(
    # x-axis using a categorical variable
    x = survived
    )
  ) + 
  # draw barplot
  geom_bar(
    # aethetics
    aes(
      # y-axis with computed proportions
      y=..count../sum(..count..))
    ) + 
  # relabel y-axis
  ylab("proportion") + 
  # relabel title
  ggtitle("Relative Frequencies")

# visualize barplots as subplots
grid.arrange(p1, p2, ncol=2)

Bar Plots vs Histograms

How are bar plots different than histograms?

Bar Plots with Two Variables

Bar plots are graphical representations of categorical data using rectangular bars of varying heights.

Features:

Types of Bar Plots:

Examples of Bar Plots

The following bar plots still uses the geom_bar() layer but with defined parameter option position.

\(\star\) Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.

Bar Plots with Categories: Using R

Load Packages

library(tidyverse)
library(gridExtra)

Plotting Bar Plots (with categories)

# create barplot 1 with raw counts
p6 <- ggplot(
  # dataframe
  titanic,
  # aesthetics
  aes(
    # x-axis using a categorical variable
    x = class,
    # fill the bars using a categorical variable
    fill = survived
    )
  ) + 
  # draw barplot
  geom_bar(
    # set barplot type 
    position="dodge"
    ) + 
  # relabel title
  ggtitle("Side-by-Side (position='dodge')")

# visualize barplot
p6

Mosaic Plots

A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.

Features:

Mosaic Plots: Using R

Load Packages

library(tidyverse)
library(ggmosiac)

Visualizing Categorical Variables using a Mosaic Plot

# create a mosaic plot
p4 <- ggplot(titanic) + 
  # draw mosaic plot
  geom_mosaic(aes(x=product(class,survived), fill = survived)) + 
  # use the mosaic theme layer
  theme_mosaic() + 
  # remove the legend
  theme(legend.position = "none")

# visualize mosaic plot
p4

Pie Charts

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.

Features:

Pie Charts: Using R

Load Packages

library(tidyverse)

Visualizing Categorical Variables using Pie Charts

# subset the data
titanic_1st <- titanic %>% 
  # choose observations that belong in the 1st class
  filter(class=="1st")

# create a pie chart
p5 <- ggplot(
  # dataframe
  titanic_1st, 
  # aesthetics
  aes(
    # let x-axis be a blank character
    x="", 
    # y-axis using a categorical variable
    y=survived, 
    # fill the geometry using a categorical variable
    fill=survived
    )
  ) +
  # draw barplot
  geom_bar(
    # define style of barplot
    stat="identity",
    # set bar width
    width=1
    ) +
  # convert barplot to polar coordinates
  coord_polar(
    # use the y-axis
    "y",
    # starting radians for rotation
    start=0
    ) + 
  # relabel title
  ggtitle("1st Class") + 
  # use the void theme
  theme_void()

# visualize pie chart
p5

Comparing Categorical Data to Numerical Data

Example: Age distribution among class passengers and survival.

\(\star\) The boxplot visually compares the spread, median, and possible outliers of age distributions among different classes while distinguishing survival status using color. This helps in analyzing whether age and class had an impact on survival rates on the Titanic.

Comparing Categorical Data to Numerical Data: Using R

Load Packages

library(tidyverse)

Visualizing Multiple Variables

p6 <- ggplot(
  # dataframe
  titanic,
  # aesthetics
  aes(
    # x-axis using a numerical variable
    x = age,
    # y-axis using a categorical variable
    y = class,
    # fill the boxplot using a categorical variable
    fill = survived
    )
  ) + 
  # draw boxplot
  geom_boxplot()

# visualize boxplot
p6