MTH-361A | Spring 2026 | University of Portland
tidyverse verbs and
ggplot2 geometries to visualize numerical and categorical
data effectivelyVariables
Visualization Techniques
The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.
The sinking of the Titanic illustration by Willy Stöwer
The Titanic Data Matrix
The number of observations in the data set is \(2207\) passengers.
A table that summarizes data for two categorical variables is called a contingency table.
Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.
| 1st | 2nd | 3rd | crew | Sum | |
|---|---|---|---|---|---|
| no | 123 | 166 | 528 | 679 | 1496 |
| yes | 201 | 118 | 181 | 211 | 711 |
| Sum | 324 | 284 | 709 | 890 | 2207 |
Load Packages
Producing and presenting Contingency Tables in R
# create a new dataframe to store frequencies
titanic_surv_class <- titanic %>%
# select variables
select(survived,class) %>%
# group observations by categories
group_by(survived,class) %>%
# count number of observations by categories
summarise(total = n(),
.groups = 'drop')
# visualize contingency table
kable(addmargins(xtabs(total ~ survived + class, titanic_surv_class)))For those who survived, what is the proportion of passenger class?
Contingency Table:
| 1st | 2nd | 3rd | crew | Sum | |
|---|---|---|---|---|---|
| no | 123 | 166 | 528 | 679 | 1496 |
| yes | 201 | 118 | 181 | 211 | 711 |
| Sum | 324 | 284 | 709 | 890 | 2207 |
To answer this question we examine the row proportions:
There appears to be almost equal proportion of passenger class who survived. Note that this is only considering cases who survived.
Does there appear to be a relationship between class and survival for passengers on the Titanic?
Contingency Table:
| 1st | 2nd | 3rd | crew | Sum | |
|---|---|---|---|---|---|
| no | 123 | 166 | 528 | 679 | 1496 |
| yes | 201 | 118 | 181 | 211 | 711 |
| Sum | 324 | 284 | 709 | 890 | 2207 |
To answer this question we examine the column proportions:
The disproportionate survival of 1st class passengers suggests a relationship between class and survival.
\(\star\) Row and column proportions answer different questions and both are important. Row proportions show a near-equal passenger class split among survivors, while column proportions reveal a higher 1st class survival rate, reflecting the event when they prioritized the 1st class passengers for the life boats.
A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.
Load Packages
Plotting Bar Plots (raw counts and proportions)
# create barplot 1 with raw counts
p1 <- ggplot(
# dataframe
titanic,
# aesthetics
aes(
# x-axis using a categorical variable
x = survived
)
) +
# draw barplot
geom_bar() +
# relabel title
ggtitle("Frequencies")
# create barplot 2 with relative frequencies
p2 <- ggplot(
# dataframe
titanic,
# aethetics
aes(
# x-axis using a categorical variable
x = survived
)
) +
# draw barplot
geom_bar(
# aethetics
aes(
# y-axis with computed proportions
y=..count../sum(..count..))
) +
# relabel y-axis
ylab("proportion") +
# relabel title
ggtitle("Relative Frequencies")
# visualize barplots as subplots
grid.arrange(p1, p2, ncol=2)How are bar plots different than histograms?
Bar plots are graphical representations of categorical data using rectangular bars of varying heights.
Features:
Types of Bar Plots:
Stacked bar plot: Graphical display of contingency table information, for counts.
Side-by-side bar plot: Displays the same information by placing bars next to, instead of on top of, each other.
Standardized stacked bar plot: Graphical display of contingency table information, for proportions.
The following bar plots still uses the geom_bar() layer
but with defined parameter option position.
\(\star\) Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.
Load Packages
Plotting Bar Plots (with categories)
# create barplot 1 with raw counts
p6 <- ggplot(
# dataframe
titanic,
# aesthetics
aes(
# x-axis using a categorical variable
x = class,
# fill the bars using a categorical variable
fill = survived
)
) +
# draw barplot
geom_bar(
# set barplot type
position="dodge"
) +
# relabel title
ggtitle("Side-by-Side (position='dodge')")
# visualize barplot
p6A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.
Features:
Load Packages
Visualizing Categorical Variables using a Mosaic Plot
A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.
Features:
Load Packages
Visualizing Categorical Variables using Pie Charts
# subset the data
titanic_1st <- titanic %>%
# choose observations that belong in the 1st class
filter(class=="1st")
# create a pie chart
p5 <- ggplot(
# dataframe
titanic_1st,
# aesthetics
aes(
# let x-axis be a blank character
x="",
# y-axis using a categorical variable
y=survived,
# fill the geometry using a categorical variable
fill=survived
)
) +
# draw barplot
geom_bar(
# define style of barplot
stat="identity",
# set bar width
width=1
) +
# convert barplot to polar coordinates
coord_polar(
# use the y-axis
"y",
# starting radians for rotation
start=0
) +
# relabel title
ggtitle("1st Class") +
# use the void theme
theme_void()
# visualize pie chart
p5Example: Age distribution among class passengers and survival.
\(\star\) The boxplot visually compares the spread, median, and possible outliers of age distributions among different classes while distinguishing survival status using color. This helps in analyzing whether age and class had an impact on survival rates on the Titanic.
Load Packages
Visualizing Multiple Variables