MTH-361A | Spring 2025 | University of Portland
January 31, 2025
ggplot2
packageTypes of Variables
Types of Variables
Chaining dplyr
Verbs Using
%>%
Load Packages
Define Data Frame as a Tibble
Advanced Example: The goal of this example is to
transform the iris
dataset by computing the ratio of
Petal.Length
to Sepal.Length
for observations
belonging to the “setosa” species.
tidyverse
Core Packages for Data Visualizationstidyverse
is a collection of packages suited for data
processing and visualization.
Core packages specifically for data visualizations:
ggplot2
is a system for creating graphics, where you
provide the data, specify how to map variables to plots, and it handles
the details.ggplot2
What is ggplot2
?
ggplot2
is a powerful R package designed for data
visualizations.tidyverse
ecosystem.+
.Why use ggplot2
?
tidyverse
packages such as dplyr
for data
wrangling.What is the Grammar of Graphics?
ggplot2
package in R is based on this framework,
allowing for a highly customizable and layered approach to data
visualization.ggplot2
Data: The dataset being visualized in tibble form.
Aesthetics (aes): The mapping of data variables to visual properties like position, color, size, and shape.
Geometries (geom): The type of plot (e.g., points, lines, bars) that represents the data.
Facets: Splitting data into multiple panels for comparison.
Statistics (stat): Computations applied to the data before plotting (e.g., smoothing, binning).
Coordinates (coord): The system defining how data is mapped onto the plot (e.g., Cartesian, polar).
Themes: Controls the overall appearance of the plot, such as background color, grid lines, and fonts.
iris
Data Set Scatter PlotsPlotting iris
lengths
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length)) +
# draw scatter plot
geom_point() +
# add theme layer
theme_grey()
Layered Approach
ggplot(data, aes(...))
defines the dataset and variables.+ geom_*()
specifies
the type of plot.+ facet_*(), + coord_*(), + theme_*()
enhance the
visualization.\(\star\) Note that the
+
operator here is used to “add” a layer, not adding
numbers.
The aes()
function maps data variables to visual
properties like position, color, size, and shape.
Plotting iris
lengths by species
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
# draw scatter plot
geom_point() +
# add theme layer
theme_grey()
Common Aesthetics Mappings
x
and y
: Map variables to
the horizontal and vertical axes.color
: Assign colors to different
categories.size
: Control the size of points or
lines based on a variable.shape
: Change the shape of points
according to a categorical variable.\(\star\) Note that the
aes()
function is called within the ggplot()
function as the second argument.
Using the +
operator allows us to add layers to the
plot, which is used for customizing the plot or adding more
information.
Plotting iris
lengths by species with regression
lines
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
# draw scatter plot
geom_point() +
# add regression lines
geom_smooth(method = 'lm',
formula = y~x) +
# add theme layer
theme_grey()
Layering to add more information
geom_smooth()
geom_smooth()
fits a regression line to each group,
then adds this line as a layer on the plot. We will discuss more on
this later.\(\star\) Key Idea:
All subsequent layers will inherit all information of the
aes()
variables defined in the ggplot()
function.
Geom | Function |
---|---|
geom_point() |
Scatter plot for visualizing relationships between two numerical variables. |
geom_line() |
Line plot for trends over time or continuous sequences. |
geom_histogram() |
Histogram for visualizing the distribution of a single numerical variable. |
geom_dotplot() |
Shows each dot representing one observation in a distribution. |
geom_boxplot() |
Box plot for showing distributions and detecting outliers. |
\(\star\) Be careful when defining
variables in aes()
. For example,
geom_histogram()
only requires an x-axis variable, as it
plots the distribution of a single numerical variable.
iris
Data Set HistogramsPlotting the distribution of iris
sepal lengths
by species
# establish data and variables
ggplot(iris_tibble,
aes(x = Sepal.Length,
fill = Species)) +
# draw histogram
geom_histogram(bins=10) +
# add theme layer
theme_grey()
\(\star\) The
geom_histogram()
function allows you to adjust the number
of bins, affecting how the data is visualized. We will discuss this
later.
Variables
Visualization Techniques
Scatterplots are useful for visualizing the relationship between two numerical variables.
Example:
Sepal.Length
and
Petal.Length
in the iris
data set appears to
have a positive association. As Sepal.Length
increases,
Petal.Length
also increases.Dot plots are useful for visualizing one numerical variable. Darker colors represent areas where there are more observations.
How would you describe the distribution of GPAs in this data set?
The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data.
The mean GPA is 3.59.
Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much?
Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?
\(\star\) Note: In order to determine modality, step back and imagine a smooth curve over the histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve.
Is the histogram right skewed, left skewed, or symmetric?
\(\star\) Note: Histograms are said to be skewed to the side of the long tail.
The measures of central tendency describe the central or typical value of a dataset, summarizing its distribution. The following are the common measures of central tendency:
Mode \(<\) Median \(<\) Mean
Mean \(<\) Median \(<\) Mode
Mean \(=\) Median \(=\) Mode
The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:
Common variance interpretations:
Example: Order the following distributions from low to high variance.
The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.
Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.
Example: Suppose we analyze the annual salaries of employees in two different companies:
Company | \(\overline{x}\) | \(s^2\) | \(s\) |
---|---|---|---|
A | $55K | $62.5K | $7.906K |
B | $130K | $2250K | $47.434K |
\(\star\) Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
Box Plot of Study Hours
Histogram of Study Hours
\(\star\) Key Idea: Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.
Main Parts:
Whiskers of a box plot can extend up to \(1.5 \times IQR\) away from the quartiles. The \(1.5 \times \text{IQR}\) is arbitrary, and is considered an academic standard and the default in plotting box plots.
Example: Suppose that \(Q_1 = 10\) and \(Q_3 = 20\).
\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]
\(\star\) A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.
Example: The boxplot and stacked dot plot of the data set \(7,1,2,4,6,3,2,7\) is shown below.
Box Plot
Stacked Dot Plot
\(\star\) A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.
How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?
data | \(\overline{x}\) | \(s\) | Median | IQR |
---|---|---|---|---|
original | 244,778.6 | 225,903.8 | 190,000 | 200,000 |
replace largest with $10M | 566,207.1 | 1,830,200.6 | 190,000 | 200,000 |
replace smallest with $10M | 316,121.4 | 854,489.2 | 200,000 | 200,000 |
\(\star\) The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.