Basics of Visualizations &
Exploring Numerical Data

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

January 31, 2025

Objectives

Previously… (1/2)

Types of Variables

Types of Variables

Types of Variables

Previously… (1/2)

Chaining dplyr Verbs Using %>%

Load Packages

library(tidyverse)

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble %>% 
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") %>% 
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) %>% 
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

tidyverse Core Packages for Data Visualizations

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data visualizations:

  • ggplot2 is a system for creating graphics, where you provide the data, specify how to map variables to plots, and it handles the details.

Data Visualization Using ggplot2

What is ggplot2?

Why use ggplot2?

The Grammar for Graphics

What is the Grammar of Graphics?

Key Components of ggplot2

Example: iris Data Set Scatter Plots

Plotting iris lengths

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Layered Approach

\(\star\) Note that the + operator here is used to “add” a layer, not adding numbers.

Aesthetics

The aes() function maps data variables to visual properties like position, color, size, and shape.

Plotting iris lengths by species

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Common Aesthetics Mappings

\(\star\) Note that the aes() function is called within the ggplot() function as the second argument.

Layering

Using the + operator allows us to add layers to the plot, which is used for customizing the plot or adding more information.

Plotting iris lengths by species with regression lines

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add regression lines
  geom_smooth(method = 'lm',
              formula = y~x) +
  # add theme layer
  theme_grey()

Layering to add more information

\(\star\) Key Idea: All subsequent layers will inherit all information of the aes() variables defined in the ggplot() function.

Common Geometries for Numerical Variables

Geom Function
geom_point() Scatter plot for visualizing relationships between two numerical variables.
geom_line() Line plot for trends over time or continuous sequences.
geom_histogram() Histogram for visualizing the distribution of a single numerical variable.
geom_dotplot() Shows each dot representing one observation in a distribution.
geom_boxplot() Box plot for showing distributions and detecting outliers.

\(\star\) Be careful when defining variables in aes(). For example, geom_histogram() only requires an x-axis variable, as it plots the distribution of a single numerical variable.

Example: iris Data Set Histograms

Plotting the distribution of iris sepal lengths by species

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           fill = Species)) +
  # draw histogram
  geom_histogram(bins=10) + 
  # add theme layer
  theme_grey()

\(\star\) The geom_histogram() function allows you to adjust the number of bins, affecting how the data is visualized. We will discuss this later.

Basics of Exploring Numerical Data

Variables

Visualization Techniques

Scatterplots

Scatterplots are useful for visualizing the relationship between two numerical variables.

Example:

Dot Plots

Dot plots are useful for visualizing one numerical variable. Darker colors represent areas where there are more observations.

How would you describe the distribution of GPAs in this data set?

Dot Plots and the Mean

The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data.

The mean GPA is 3.59.

Histograms

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.

Bin Width of Histograms

Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much?

Distribution Shapes: Modality

Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?

\(\star\) Note: In order to determine modality, step back and imagine a smooth curve over the histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve.

Distribution Shapes: Skewness

Is the histogram right skewed, left skewed, or symmetric?

\(\star\) Note: Histograms are said to be skewed to the side of the long tail.

Commonly Observed Distribution Shapes

Measures of Central Tendency

The measures of central tendency describe the central or typical value of a dataset, summarizing its distribution. The following are the common measures of central tendency:

Skewness and Measures Central Tendency (1/3)

Mode \(<\) Median \(<\) Mean

Skewness and Measures Central Tendency (2/3)

Mean \(<\) Median \(<\) Mode

Skewness and Measures Central Tendency (3/3)

Mean \(=\) Median \(=\) Mode

Measures of Dispersion (Spread)

The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:

Making Sense of the Variance

Common variance interpretations:

Example: Order the following distributions from low to high variance.

The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.

Making Sense of the Standard Deviation

Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.

Example: Suppose we analyze the annual salaries of employees in two different companies:

Company \(\overline{x}\) \(s^2\) \(s\)
A $55K $62.5K $7.906K
B $130K $2250K $47.434K

\(\star\) Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.

Box Plots

The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.

Box Plot of Study Hours

Box Plot of Study Hours

Histogram of Study Hours

Histogram of Study Hours

\(\star\) Key Idea: Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.

Anatomy of Box Plots

Main Parts:

Computing the Whiskers

Whiskers of a box plot can extend up to \(1.5 \times IQR\) away from the quartiles. The \(1.5 \times \text{IQR}\) is arbitrary, and is considered an academic standard and the default in plotting box plots.

Example: Suppose that \(Q_1 = 10\) and \(Q_3 = 20\).

\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]

\(\star\) A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.

More Boxplot and Dot Plot Examples

Example: The boxplot and stacked dot plot of the data set \(7,1,2,4,6,3,2,7\) is shown below.

Box Plot

Box Plot

Stacked Dot Plot

Stacked Dot Plot

\(\star\) A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.

Why Outliers are Important?

Case Study 1

How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?

Case Study 1: Extreme Observations

data \(\overline{x}\) \(s\) Median IQR
original 244,778.6 225,903.8 190,000 200,000
replace largest with $10M 566,207.1 1,830,200.6 190,000 200,000
replace smallest with $10M 316,121.4 854,489.2 200,000 200,000

\(\star\) The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

Activity: Visualize Numerical Data

  1. Log-in to Posit Cloud and open the R Studio assignment F 1/31 - Visualize Numerical Data.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/