Transforming Tables for Visualizations

Fundamentals of Data Science

MTH-391A | Spring 2025 | University of Portland

February 12, 2025

Objectives

Previously… (1/2)

Chaining dplyr Verbs Using |>

Load Packages

library(tidyverse)

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble |>  
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") |>  
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) |>  
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

Previously… (2/2)

Visualizing Numerical Data using ggplot2

Example: Plotting lengths by species in the iris data set.

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

::::

Case Study 1: Titanic Data Set

The the Titanic Data Set is a popular dataset in data science. It contains information about the passengers aboard the Titanic and is often used for survival prediction.

# The titanic.csv data is from https://www.kaggle.com/datasets/aliaamiri/titanic-passengers-and-crew-complete-list
titanic <- read_csv("titanic.csv") |> 
  mutate(age = round(age,0),
         # recode the "crew" class of the class categorical variable using the `case_when` function
         class = case_when(
           class == "deck crew" ~ "crew",
           class == "engineering crew" ~ "crew",
           class == "restaurant staff" ~ "crew",
           class == "victualling crew" ~ "crew",
           TRUE ~ class
         ))
glimpse(titanic)
## Rows: 2,207
## Columns: 11
## $ name     <chr> "Abbing, Mr. Anthony", "Abbott, Mr. Eugene Joseph", "Abbott, …
## $ gender   <chr> "male", "male", "male", "female", "female", "male", "male", "…
## $ age      <dbl> 42, 13, 16, 39, 16, 25, 30, 28, 27, 20, 30, 27, 40, 1, 18, 35…
## $ class    <chr> "3rd", "3rd", "3rd", "3rd", "3rd", "3rd", "2nd", "2nd", "3rd"…
## $ embarked <chr> "S", "S", "S", "S", "S", "S", "C", "C", "C", "S", "S", "S", "…
## $ country  <chr> "United States", "United States", "United States", "England",…
## $ ticketno <dbl> 5547, 2673, 2673, 2673, 348125, 348122, 3381, 3381, 2699, 310…
## $ fare     <dbl> 7.1100, 20.0500, 20.0500, 20.0500, 7.1300, 7.1300, 24.0000, 2…
## $ sibsp    <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ parch    <dbl> 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0…
## $ survived <chr> "no", "no", "no", "yes", "yes", "yes", "no", "yes", "yes", "y…

Presenting Tables

To present tables effectively, we can use kable from the knitr package.

Load Package

library(knitr)

Example: The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.

titanic_surv_class <- titanic |> select(survived,class) |> 
  group_by(survived,class) |> 
  summarise(total = n(),
            .groups = 'drop')
kable(xtabs(total ~ survived + class, titanic_surv_class))
1st 2nd 3rd crew
no 123 166 528 679
yes 201 118 181 211

Bar Plots

A bar plot is a common way to display a single categorical variable. A bar plot where proportions instead of frequencies are shown is called a relative frequency bar plot.

p1 <- ggplot(titanic,aes(x=survived)) + 
  geom_bar() + 
  ggtitle("Frequencies") + 
  theme_minimal()
p2 <- ggplot(titanic,aes(x=survived)) + 
  geom_bar(aes(y=..count../sum(..count..))) + 
  ylab("proportion") + 
  ggtitle("Relative Frequencies") + 
  theme_minimal()
grid.arrange(p1, p2, ncol=2)

Bar Plots with Two Variables

p1 <- ggplot(titanic,aes(x=class,fill=survived)) + 
  geom_bar() + 
  scale_fill_manual(values=c("yes" = "blue", "no" = "red")) + 
  ggtitle("Stacked") + 
  theme_minimal()
p2 <- ggplot(titanic,aes(x=class,fill=survived)) + 
  geom_bar(position="dodge") + 
  scale_fill_manual(values=c("yes" = "blue", "no" = "red")) + 
  ggtitle("Side-by-Side") + 
  theme_minimal()
p3 <- ggplot(titanic,aes(x=class,fill=survived)) + 
  geom_bar(position="fill") + 
  scale_fill_manual(values=c("yes" = "blue", "no" = "red")) + 
  ggtitle("Standardized") + 
  theme_minimal()
grid.arrange(p1, p2, p3, ncol=3)

\(\star\) Key Idea: Each visualization provides a different perspective: Stacked helps compare total passengers across classes. Side-by-Side helps compare survival counts directly. Standardized helps compare survival rates across classes.

Mosaic Plots

A mosaic plot is a graphical representation of categorical data, displaying proportions and relationships between multiple categorical variables using a tiled area chart.

ggplot(titanic) + 
  geom_mosaic(aes(x=product(class,survived), fill = survived)) + 
  scale_fill_brewer(palette = 1) +
  theme_mosaic() + 
  theme(legend.position = "none")

Features:

Pie Charts

A pie chart is a circular statistical graphic divided into slices to illustrate numerical proportions of a whole.

Features:

p1 <- ggplot(titanic |> filter(class=="1st"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("1st Class") + 
  theme_void()
p2 <- ggplot(titanic |> filter(class=="2nd"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("2nd Class") + 
  theme_void()
p3 <- ggplot(titanic |> filter(class=="3rd"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("3rd Class") + 
  theme_void()
p4 <- ggplot(titanic |> filter(class=="crew"), aes(x="", y=survived, fill=survived)) +
  geom_bar(stat="identity", width=1) +
  scale_fill_brewer(palette = 1) +
  coord_polar("y", start=0) + 
  ggtitle("Crew") + 
  theme_void()
grid.arrange(p1, p2, p3, p4, nrow=2, ncol=2)

Comparing Categorical Data to Numerical Data

Example: Age distribution among class passengers and survival.

ggplot(titanic,aes(y=class,x=age, fill=survived)) + 
  geom_boxplot() + 
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Adding More Variables

Example: Age distribution among class passengers and survival.

ggplot(titanic,aes(y=fare,x=age, color=survived, pch = gender)) + 
  geom_point() + 
  scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Activity: Visualize Multiple Variables

  1. Log-in to Posit Cloud and open the R Studio assignment MA8: Visualize Multiple Variables.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.