MTH-361A | Spring 2025 | University of Portland
February 26, 2025
tidyverse
verbs and
ggplot2
geometries to visualize numerical and categorical
data effectivelyChaining dplyr
Verbs Using
%>%
Load Packages
Define Data Frame as a Tibble
Advanced Example: The goal of this example is to
transform the iris
dataset by computing the ratio of
Petal.Length
to Sepal.Length
for observations
belonging to the “setosa” species.
Visualizing Numerical Data using
ggplot2
Example: Plotting lengths by species in the
iris
data set.
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
# draw scatter plot
geom_point() +
# add theme layer
theme_grey()
Descriptive statistics
It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.
For Numerical Variables
For Categorical Variables
The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.
The sinking of the Titanic illustration by Willy Stöwer
The Titanic Data Matrix
The number of observations in the data set is \(2207\) passengers.
A table that summarizes data for two categorical variables is called a contingency table.
Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.
1st | 2nd | 3rd | crew | Sum | |
---|---|---|---|---|---|
no | 123 | 166 | 528 | 679 | 1496 |
yes | 201 | 118 | 181 | 211 | 711 |
Sum | 324 | 284 | 709 | 890 | 2207 |
Load Packages
Producing and presenting Contingency Tables in R
For those who survived, what is the proportion of passenger class?
Contingency Table:
1st | 2nd | 3rd | crew | Sum | |
---|---|---|---|---|---|
no | 123 | 166 | 528 | 679 | 1496 |
yes | 201 | 118 | 181 | 211 | 711 |
Sum | 324 | 284 | 709 | 890 | 2207 |
To answer this question we examine the row proportions:
There appears to be almost equal proportion of passenger class who survived. Note that this is only considering cases who survived.
Does there appear to be a relationship between class and survival for passengers on the Titanic?
Contingency Table:
1st | 2nd | 3rd | crew | Sum | |
---|---|---|---|---|---|
no | 123 | 166 | 528 | 679 | 1496 |
yes | 201 | 118 | 181 | 211 | 711 |
Sum | 324 | 284 | 709 | 890 | 2207 |
To answer this question we examine the column proportions:
The disproportionate survival of 1st class passengers suggests a relationship between class and survival.
\(\star\) Key Idea: Row and column proportions answer different questions and both are important. Row proportions show a near-equal passenger class split among survivors, while column proportions reveal a higher 1st class survival rate, reflecting the event when they prioritized the 1st class passengers for the life boats.
A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.
Load Packages
Plotting Bar Plots (raw Counts and Croportions)
p1 <- ggplot(titanic,aes(x=survived)) +
geom_bar(fill=COL[1,1]) +
ggtitle("Frequencies") +
theme_minimal()
p2 <- ggplot(titanic,aes(x=survived)) +
geom_bar(aes(y=..count../sum(..count..)),fill=COL[1,1]) +
ylab("proportion") +
ggtitle("Relative Frequencies") +
theme_minimal()
grid.arrange(p1, p2, ncol=2)
How are bar plots different than histograms?
Bar plots are graphical representations of categorical data using rectangular bars of varying heights.
Features:
Types of Bar Plots:
Stacked bar plot: Graphical display of contingency table information, for counts.
Side-by-side bar plot: Displays the same information by placing bars next to, instead of on top of, each other.
Standardized stacked bar plot: Graphical display of contingency table information, for proportions.
The following bar plots still uses the geom_bar()
layer
but with defined parameter option position
.
\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.
A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.
Features:
Load Packages
Visualizing Categorical Variables using a Mosaic Plot
ggplot(titanic) +
geom_mosaic(aes(x=product(class,survived), fill = survived)) +
scale_fill_brewer(palette = 1) +
theme_mosaic() +
theme(legend.position = "none")
A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.
Features:
Load Packages
Visualizing Categorical Variables using Pie Charts
Example: Age distribution among class passengers and survival.
\(\star\) The boxplot visually compares the spread, median, and possible outliers of age distributions among different classes while distinguishing survival status using color. This helps in analyzing whether age and class had an impact on survival rates on the Titanic.
Load Packages
Visualizing Multiple Variables
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.