Advanced Visualizations &
Exploring Categorical Data

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

February 26, 2025

Objectives

Previously… (1/3)

Chaining dplyr Verbs Using %>%

Load Packages

library(tidyverse)

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble %>%  
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") %>%  
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) %>%   
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

Previously… (2/3)

Visualizing Numerical Data using ggplot2

Example: Plotting lengths by species in the iris data set.

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Previously… (3/3)

Descriptive statistics

It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

For Categorical Variables

Case Study 1: Titanic Data Set

The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.

The sinking of the Titanic illustration by Willy Stöwer

The sinking of the Titanic illustration by Willy Stöwer

The Titanic Data Matrix

The Titanic Data Matrix

The number of observations in the data set is \(2207\) passengers.

Contingency Tables

A table that summarizes data for two categorical variables is called a contingency table.

Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

Contingency Tables: Using R

Load Packages

library(tidyverse)
library(knitr)

Producing and presenting Contingency Tables in R

titanic_surv_class <- titanic %>% 
  select(survived,class) %>% 
  group_by(survived,class) %>% 
  summarise(total = n(),
            .groups = 'drop')
kable(addmargins(xtabs(total ~ survived + class, titanic_surv_class)))

Row Proportions

For those who survived, what is the proportion of passenger class?

Contingency Table:

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

To answer this question we examine the row proportions:

There appears to be almost equal proportion of passenger class who survived. Note that this is only considering cases who survived.

Column Proportions

Does there appear to be a relationship between class and survival for passengers on the Titanic?

Contingency Table:

1st 2nd 3rd crew Sum
no 123 166 528 679 1496
yes 201 118 181 211 711
Sum 324 284 709 890 2207

To answer this question we examine the column proportions:

The disproportionate survival of 1st class passengers suggests a relationship between class and survival.

\(\star\) Key Idea: Row and column proportions answer different questions and both are important. Row proportions show a near-equal passenger class split among survivors, while column proportions reveal a higher 1st class survival rate, reflecting the event when they prioritized the 1st class passengers for the life boats.

Bar Plots

A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

Bar Plots: Using R

Load Packages

library(tidyverse)
library(gridExtra)

Plotting Bar Plots (raw Counts and Croportions)

p1 <- ggplot(titanic,aes(x=survived)) + 
  geom_bar(fill=COL[1,1]) + 
  ggtitle("Frequencies") + 
  theme_minimal()
p2 <- ggplot(titanic,aes(x=survived)) + 
  geom_bar(aes(y=..count../sum(..count..)),fill=COL[1,1]) + 
  ylab("proportion") + 
  ggtitle("Relative Frequencies") + 
  theme_minimal()
grid.arrange(p1, p2, ncol=2)

Bar Plots vs Histograms

How are bar plots different than histograms?

Bar Plots with Two Variables

Bar plots are graphical representations of categorical data using rectangular bars of varying heights.

Features:

Types of Bar Plots:

Examples of Bar Plots

The following bar plots still uses the geom_bar() layer but with defined parameter option position.

\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.

Mosaic Plots

A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.

Features:

Mosaic Plots: Using R

Load Packages

library(tidyverse)
library(ggmosiac)

Visualizing Categorical Variables using a Mosaic Plot

ggplot(titanic) + 
  geom_mosaic(aes(x=product(class,survived), fill = survived)) + 
  scale_fill_brewer(palette = 1) +
  theme_mosaic() + 
  theme(legend.position = "none")

Pie Charts

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.

Features:

Pie Charts: Using R

Load Packages

library(tidyverse)

Visualizing Categorical Variables using Pie Charts

ggplot(titanic %>% filter(class=="1st"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("1st Class") + 
  theme_void()

Comparing Categorical Data to Numerical Data

Example: Age distribution among class passengers and survival.

\(\star\) The boxplot visually compares the spread, median, and possible outliers of age distributions among different classes while distinguishing survival status using color. This helps in analyzing whether age and class had an impact on survival rates on the Titanic.

Comparing Categorical Data to Numerical Data: Using R

Load Packages

library(tidyverse)

Visualizing Multiple Variables

ggplot(titanic,aes(y=class,x=age, fill=survived)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Activity: Visualize Multiple Variables

  1. Log-in to Posit Cloud and open the R Studio assignment W 2/26 - Visualize Multiple Variables Techniques.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/