MTH-391A | Spring 2025 | University of Portland
February 12, 2025
tidyverse
verbs and
ggplot2
geometries to visualize numerical and categorical
data effectivelyChaining dplyr
Verbs Using
|>
Load Packages
Define Data Frame as a Tibble
Advanced Example: The goal of this example is to
transform the iris
dataset by computing the ratio of
Petal.Length
to Sepal.Length
for observations
belonging to the “setosa” species.
Visualizing Numerical Data using
ggplot2
Example: Plotting lengths by species in the
iris
data set.
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
# draw scatter plot
geom_point() +
# add theme layer
theme_grey()
::::
The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.
# The titanic.csv data is from https://www.kaggle.com/datasets/aliaamiri/titanic-passengers-and-crew-complete-list
titanic <- read_csv("titanic.csv") |>
mutate(age = round(age,0),
# recode the "crew" class of the class categorical variable using the `case_when` function
class = case_when(
class == "deck crew" ~ "crew",
class == "engineering crew" ~ "crew",
class == "restaurant staff" ~ "crew",
class == "victualling crew" ~ "crew",
TRUE ~ class
))
glimpse(titanic)
## Rows: 2,207
## Columns: 11
## $ name <chr> "Abbing, Mr. Anthony", "Abbott, Mr. Eugene Joseph", "Abbott, …
## $ gender <chr> "male", "male", "male", "female", "female", "male", "male", "…
## $ age <dbl> 42, 13, 16, 39, 16, 25, 30, 28, 27, 20, 30, 27, 40, 1, 18, 35…
## $ class <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "2nd", "2nd", "3rd"…
## $ embarked <chr> "S", "S", "S", "S", "S", "S", "C", "C", "C", "S", "S", "S", "…
## $ country <chr> "United States", "United States", "United States", "England",…
## $ ticketno <dbl> 5547, 2673, 2673, 2673, 348125, 348122, 3381, 3381, 2699, 310…
## $ fare <dbl> 7.1100, 20.0500, 20.0500, 20.0500, 7.1300, 7.1300, 24.0000, 2…
## $ sibsp <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ parch <dbl> 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ survived <chr> "no", "no", "no", "yes", "yes", "yes", "no", "yes", "yes", "y…
To present tables effectively, we can use kable
from the
knitr
package.
Load Package
Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.
titanic_surv_class <- titanic |> select(survived,class) |>
group_by(survived,class) |>
summarise(total = n(),
.groups = 'drop')
kable(xtabs(total ~ survived + class, titanic_surv_class))
1st | 2nd | 3rd | crew | |
---|---|---|---|---|
no | 123 | 166 | 528 | 679 |
yes | 201 | 118 | 181 | 211 |
A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.
p1 <- ggplot(titanic,aes(x=survived)) +
geom_bar() +
ggtitle("Frequencies") +
theme_minimal()
p2 <- ggplot(titanic,aes(x=survived)) +
geom_bar(aes(y=..count../sum(..count..))) +
ylab("proportion") +
ggtitle("Relative Frequencies") +
theme_minimal()
grid.arrange(p1, p2, ncol=2)
p1 <- ggplot(titanic,aes(x=class,fill=survived)) +
geom_bar() +
scale_fill_manual(values=c("yes" = "blue", "no" = "red")) +
ggtitle("Stacked") +
theme_minimal()
p2 <- ggplot(titanic,aes(x=class,fill=survived)) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("yes" = "blue", "no" = "red")) +
ggtitle("Side-by-Side") +
theme_minimal()
p3 <- ggplot(titanic,aes(x=class,fill=survived)) +
geom_bar(position="fill") +
scale_fill_manual(values=c("yes" = "blue", "no" = "red")) +
ggtitle("Standardized") +
theme_minimal()
grid.arrange(p1, p2, p3, ncol=3)
\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.
A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.
ggplot(titanic) +
geom_mosaic(aes(x=product(class,survived), fill = survived)) +
scale_fill_brewer(palette = 1) +
theme_mosaic() +
theme(legend.position = "none")
Features:
A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.
Features:
p1 <- ggplot(titanic |> filter(class=="1st"), aes(x="", y=survived, fill=survived)) +
geom_bar(stat="identity", width=1) +
scale_fill_brewer(palette = 1) +
coord_polar("y", start=0) +
ggtitle("1st Class") +
theme_void()
p2 <- ggplot(titanic |> filter(class=="2nd"), aes(x="", y=survived, fill=survived)) +
geom_bar(stat="identity", width=1) +
scale_fill_brewer(palette = 1) +
coord_polar("y", start=0) +
ggtitle("2nd Class") +
theme_void()
p3 <- ggplot(titanic |> filter(class=="3rd"), aes(x="", y=survived, fill=survived)) +
geom_bar(stat="identity", width=1) +
scale_fill_brewer(palette = 1) +
coord_polar("y", start=0) +
ggtitle("3rd Class") +
theme_void()
p4 <- ggplot(titanic |> filter(class=="crew"), aes(x="", y=survived, fill=survived)) +
geom_bar(stat="identity", width=1) +
scale_fill_brewer(palette = 1) +
coord_polar("y", start=0) +
ggtitle("Crew") +
theme_void()
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)
Example: Age distribution among class passengers and survival.
ggplot(titanic,aes(y=class,x=age, fill=survived)) +
geom_boxplot() +
scale_fill_brewer(palette = "Set2") +
theme_minimal()
Example: Age distribution among class passengers and survival.
ggplot(titanic,aes(y=fare,x=age, color=survived, pch = gender)) +
geom_point() +
scale_fill_brewer(palette = "Set2") +
theme_minimal()
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.