Basics of Visualizations

Fundamentals of Data Science

MTH-391A | Spring 2025 | University of Portland

February 10, 2025

Objectives

Previously… (1/2)

Chaining dplyr Verbs Using |>

Load Packages

library(tidyverse)

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble |>  
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") |>  
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) |>  
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

Previously… (2/2)

The guiding principle of data science is the data science life cycle.

The Data Science Life Cycle

The Data Science Life Cycle

tidyverse Core Packages for Data Visualizations

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data visualizations:

  • ggplot2 is a system for creating graphics, where you provide the data, specify how to map variables to plots, and it handles the details.

Data Visualization Using ggplot2

What is ggplot2?

Why use ggplot2?

The Grammar for Graphics

What is the Grammar of Graphics?

Key Components of ggplot2

Example: iris Data Set Scatter Plots

Plotting iris lengths

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Layered Approach

\(\dagger\) Try out the above example code sequence with variables Sepal.Width and Petal.Width.

\(\star\) Note that the + operator here is used to “add” a layer, not adding numbers.

Aesthetics

The aes() function maps data variables to visual properties like position, color, size, and shape.

Plotting iris lengths by species

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Common Aesthetics Mappings

\(\star\) Note that the aes() function is called within the ggplot() function as the second argument.

Layering

Using the + operator allows us to add layers to the plot, which is used for customizing the plot or adding more information.

Plotting iris lengths by species with regression lines

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add regression lines
  geom_smooth(method = 'lm',
              formula = y~x) +
  # add theme layer
  theme_grey()

Layering to add more information

\(\star\) Key Idea: All subsequent layers will inherit all information of the aes() variables defined in the ggplot() function.

Common Geometries for Numerical Variables

Geom Function
geom_point() Scatter plot for visualizing relationships between two numerical variables.
geom_line() Line plot for trends over time or continuous sequences.
geom_histogram() Histogram for visualizing the distribution of a single numerical variable.
geom_boxplot() Box plot for showing distributions and detecting outliers.

\(\star\) Be careful when defining variables in aes(). For example, geom_histogram() only requires an x-axis variable, as it plots the distribution of a single numerical variable.

Example: iris Data Set Histograms

Plotting the distribution of iris sepal lengths by species

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           fill = Species)) +
  # draw histogram
  geom_histogram(bins=10) + 
  # add theme layer
  theme_grey()

\(\star\) The geom_histogram() function allows you to adjust the number of bins, affecting how the data is visualized. We will discuss this later.

Activity: Visualize Numerical Data

  1. Log-in to Posit Cloud and open the R Studio assignment MA7: Visualize Numerical Data.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.