MTH-391A | Spring 2025 | University of Portland
February 10, 2025
ggplot2
packageChaining dplyr
Verbs Using
|>
Load Packages
Define Data Frame as a Tibble
Advanced Example: The goal of this example is to
transform the iris
dataset by computing the ratio of
Petal.Length
to Sepal.Length
for observations
belonging to the “setosa” species.
The guiding principle of data science is the data science life cycle.
The Data Science Life Cycle
tidyverse
Core Packages for Data Visualizationstidyverse
is a collection of packages suited for data
processing and visualization.
Core packages specifically for data visualizations:
ggplot2
is a system for creating graphics, where you
provide the data, specify how to map variables to plots, and it handles
the details.ggplot2
What is ggplot2
?
ggplot2
is a powerful R package designed for data
visualizations.tidyverse
ecosystem.+
.Why use ggplot2
?
tidyverse
packages such as dplyr
for data
wrangling.What is the Grammar of Graphics?
ggplot2
package in R is based on this framework,
allowing for a highly customizable and layered approach to data
visualization.ggplot2
Data: The dataset being visualized in tibble form.
Aesthetics (aes): The mapping of data variables to visual properties like position, color, size, and shape.
Geometries (geom): The type of plot (e.g., points, lines, bars) that represents the data.
Facets: Splitting data into multiple panels for comparison.
Statistics (stat): Computations applied to the data before plotting (e.g., smoothing, binning).
Coordinates (coord): The system defining how data is mapped onto the plot (e.g., Cartesian, polar).
Themes: Controls the overall appearance of the plot, such as background color, grid lines, and fonts.
iris
Data Set Scatter PlotsPlotting iris
lengths
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length)) +
# draw scatter plot
geom_point() +
# add theme layer
theme_grey()
Layered Approach
ggplot(data, aes(...))
defines the dataset and variables.+ geom_*()
specifies
the type of plot.+ facet_*(), + coord_*(), + theme_*()
enhance the
visualization.\(\dagger\) Try out the above
example code sequence with variables Sepal.Width
and
Petal.Width
.
\(\star\) Note that the
+
operator here is used to “add” a layer, not adding
numbers.
The aes()
function maps data variables to visual
properties like position, color, size, and shape.
Plotting iris
lengths by species
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
# draw scatter plot
geom_point() +
# add theme layer
theme_grey()
Common Aesthetics Mappings
x
and y
: Map variables to
the horizontal and vertical axes.color
: Assign colors to different
categories.size
: Control the size of points or
lines based on a variable.shape
: Change the shape of points
according to a categorical variable.\(\star\) Note that the
aes()
function is called within the ggplot()
function as the second argument.
Using the +
operator allows us to add layers to the
plot, which is used for customizing the plot or adding more
information.
Plotting iris
lengths by species with regression
lines
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
# draw scatter plot
geom_point() +
# add regression lines
geom_smooth(method = 'lm',
formula = y~x) +
# add theme layer
theme_grey()
Layering to add more information
geom_smooth()
geom_smooth()
fits a regression line to each group,
then adds this line as a layer on the plot. We will discuss more on
this later.\(\star\) Key Idea:
All subsequent layers will inherit all information of the
aes()
variables defined in the ggplot()
function.
Geom | Function |
---|---|
geom_point() |
Scatter plot for visualizing relationships between two numerical variables. |
geom_line() |
Line plot for trends over time or continuous sequences. |
geom_histogram() |
Histogram for visualizing the distribution of a single numerical variable. |
geom_boxplot() |
Box plot for showing distributions and detecting outliers. |
\(\star\) Be careful when defining
variables in aes()
. For example,
geom_histogram()
only requires an x-axis variable, as it
plots the distribution of a single numerical variable.
iris
Data Set HistogramsPlotting the distribution of iris
sepal lengths
by species
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
fill = Species)) +
# draw histogram
geom_histogram(bins=10) +
# add theme layer
theme_grey()
\(\star\) The
geom_histogram()
function allows you to adjust the number
of bins, affecting how the data is visualized. We will discuss this
later.
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.