Previously… (1/2)

Chaining dplyr Verbs Using |>

Load Packages

library(tidyverse)

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble |>  
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") |>  
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) |>  
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

Previously… (2/2)

Visualizing Numerical Data using ggplot2

Example: Plotting lengths by species in the iris data set.

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

::::

Case Study 1: Titanic Data Set

The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.

# The titanic.csv data is from https://www.kaggle.com/datasets/aliaamiri/titanic-passengers-and-crew-complete-list
titanic <- read_csv("titanic.csv") |> 
  mutate(age = round(age,0),
         # recode the "crew" class of the class categorical variable using the `case_when` function
         class = case_when(
           class == "deck crew" ~ "crew",
           class == "engineering crew" ~ "crew",
           class == "restaurant staff" ~ "crew",
           class == "victualling crew" ~ "crew",
           TRUE ~ class
         ))
glimpse(titanic)

## Rows: 2,207
## Columns: 11
## $ name     <chr> "Abbing, Mr. Anthony", "Abbott, Mr. Eugene Joseph", "Abbott, …
## $ gender   <chr> "male", "male", "male", "female", "female", "male", "male", "…
## $ age      <dbl> 42, 13, 16, 39, 16, 25, 30, 28, 27, 20, 30, 27, 40, 1, 18, 35…
## $ class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "2nd", "2nd", "3rd"…
## $ embarked <chr> "S", "S", "S", "S", "S", "S", "C", "C", "C", "S", "S", "S", "…
## $ country  <chr> "United States", "United States", "United States", "England",…
## $ ticketno <dbl> 5547, 2673, 2673, 2673, 348125, 348122, 3381, 3381, 2699, 310…
## $ fare     <dbl> 7.1100, 20.0500, 20.0500, 20.0500, 7.1300, 7.1300, 24.0000, 2…
## $ sibsp    <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ parch    <dbl> 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ survived <chr> "no", "no", "no", "yes", "yes", "yes", "no", "yes", "yes", "y…

Presenting Tables

To present tables effectively, we can use kable from the knitr package.

Load Package

library(knitr)

Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.

titanic_surv_class <- titanic |> select(survived,class) |> 
  group_by(survived,class) |> 
  summarise(total = n(),
            .groups = 'drop')
kable(xtabs(total ~ survived + class, titanic_surv_class))

	1st	2nd	3rd	crew
no	123	166	528	679
yes	201	118	181	211

Bar Plots

A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

p1 <- ggplot(titanic,aes(x=survived)) + 
  geom_bar() + 
  ggtitle("Frequencies") + 
  theme_minimal()
p2 <- ggplot(titanic,aes(x=survived)) + 
  geom_bar(aes(y=..count../sum(..count..))) + 
  ylab("proportion") + 
  ggtitle("Relative Frequencies") + 
  theme_minimal()
grid.arrange(p1, p2, ncol=2)

Bar Plots with Two Variables

p1 <- ggplot(titanic,aes(x=class,fill=survived)) + 
  geom_bar() + 
  scale_fill_manual(values=c("yes" = "blue", "no" = "red")) + 
  ggtitle("Stacked") + 
  theme_minimal()
p2 <- ggplot(titanic,aes(x=class,fill=survived)) + 
  geom_bar(position="dodge") + 
  scale_fill_manual(values=c("yes" = "blue", "no" = "red")) + 
  ggtitle("Side-by-Side") + 
  theme_minimal()
p3 <- ggplot(titanic,aes(x=class,fill=survived)) + 
  geom_bar(position="fill") + 
  scale_fill_manual(values=c("yes" = "blue", "no" = "red")) + 
  ggtitle("Standardized") + 
  theme_minimal()
grid.arrange(p1, p2, p3, ncol=3)

\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.

Mosaic Plots

A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.

ggplot(titanic) + 
  geom_mosaic(aes(x=product(class,survived), fill = survived)) + 
  scale_fill_brewer(palette = 1) +
  theme_mosaic() + 
  theme(legend.position = "none")

Features:

Each tile represents a combination of categories from two or more categorical variables.
The area of each tile is proportional to the frequency or probability of that combination.
Provides an intuitive way to analyze relationships, such as independence or associations between categorical variables.

Pie Charts

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.

Features:

Each slice represents a category’s proportion relative to the whole dataset.
The total sum of all slices equals 100%.
Useful for showing part-to-whole relationships.

p1 <- ggplot(titanic |> filter(class=="1st"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("1st Class") + 
  theme_void()
p2 <- ggplot(titanic |> filter(class=="2nd"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("2nd Class") + 
  theme_void()
p3 <- ggplot(titanic |> filter(class=="3rd"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("3rd Class") + 
  theme_void()
p4 <- ggplot(titanic |> filter(class=="crew"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("Crew") + 
  theme_void()
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

Comparing Categorical Data to Numerical Data

Example: Age distribution among class passengers and survival.

ggplot(titanic,aes(y=class,x=age, fill=survived)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Adding More Variables