Previously… (2/2)

Chaining dplyr Verbs Using %>%

Load Packages

library(tidyverse)

Define Data Frame as a Tibble

iris_tibble <- tibble(iris)

Advanced Example: The goal of this example is to transform the iris dataset by computing the ratio of Petal.Length to Sepal.Length for observations belonging to the “setosa” species.

iris_tibble %>% 
  # rule 1: choose only the "setosa" species
  filter(Species == "setosa") %>% 
  # rule 2: pick the columns Sepal.Length and Petal.Length
  select(Sepal.Length,Petal.Length) %>% 
  # rule 3: create a new column called length_ratio
  mutate(length_ratio = Petal.Length/Sepal.Length)

`tidyverse` Core Packages for Data Visualizations

tidyverse is a collection of packages suited for data processing and visualization.

Core packages specifically for data visualizations:

ggplot2 is a system for creating graphics, where you provide the data, specify how to map variables to plots, and it handles the details.

Data Visualization Using `ggplot2`

What is ggplot2?

Overview:
- ggplot2 is a powerful R package designed for data visualizations.
- It is part of the tidyverse ecosystem.
Key Features:
- Intuitive grammar for graphics.
- Intuitive syntax with layering using the operator +.

Why use ggplot2?

Ease of Use: Clear, human-readable code.
Efficiency: Built-in functions optimized for performance.
Consistency: Works seamlessly with other tidyverse packages such as dplyr for data wrangling.
Data Frames and Beyond: Works with data frames, tibbles, and databases.

The Grammar for Graphics

What is the Grammar of Graphics?

The Grammar of Graphics is a systematic approach to building visualizations by breaking down graphs into fundamental components.
The ggplot2 package in R is based on this framework, allowing for a highly customizable and layered approach to data visualization.

Key Components of `ggplot2`

Data: The dataset being visualized in tibble form.
Aesthetics (aes): The mapping of data variables to visual properties like position, color, size, and shape.
Geometries (geom): The type of plot (e.g., points, lines, bars) that represents the data.
Facets: Splitting data into multiple panels for comparison.

Statistics (stat): Computations applied to the data before plotting (e.g., smoothing, binning).
Coordinates (coord): The system defining how data is mapped onto the plot (e.g., Cartesian, polar).
Themes: Controls the overall appearance of the plot, such as background color, grid lines, and fonts.

Example: `iris` Data Set Scatter Plots

Plotting iris lengths

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Layered Approach

Base layer: ggplot(data, aes(...)) defines the dataset and variables.
Geometric layers: + geom_*() specifies the type of plot.
Other layers: + facet_*(), + coord_*(), + theme_*() enhance the visualization.

$\star$ Note that the + operator here is used to “add” a layer, not adding numbers.

Aesthetics

The aes() function maps data variables to visual properties like position, color, size, and shape.

Plotting iris lengths by species

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add theme layer
  theme_grey()

Common Aesthetics Mappings

x and y: Map variables to the horizontal and vertical axes.
color: Assign colors to different categories.
size: Control the size of points or lines based on a variable.
shape: Change the shape of points according to a categorical variable.
fill: Fill color of geometric objects of different categories.

$\star$ Note that the aes() function is called within the ggplot() function as the second argument.

Layering

Using the + operator allows us to add layers to the plot, which is used for customizing the plot or adding more information.

Plotting iris lengths by species with regression lines

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           y = Petal.Length,
           color = Species)) +
  # draw scatter plot
  geom_point() + 
  # add regression lines
  geom_smooth(method = 'lm',
              formula = y~x) +
  # add theme layer
  theme_grey()

Layering to add more information

This example used the function geom_smooth()
geom_smooth() fits a regression line to each group, then adds this line as a layer on the plot. We will discuss more on this later.

$\star$ Key Idea: All subsequent layers will inherit all information of the aes() variables defined in the ggplot() function.

Common Geometries for Numerical Variables

Geom	Function
`geom_point()`	Scatter plot for visualizing relationships between two numerical variables.
`geom_line()`	Line plot for trends over time or continuous sequences.
`geom_histogram()`	Histogram for visualizing the distribution of a single numerical variable.
`geom_dotplot()`	Shows each dot representing one observation in a distribution.
`geom_boxplot()`	Box plot for showing distributions and detecting outliers.

$\star$ Be careful when defining variables in aes(). For example, geom_histogram() only requires an x-axis variable, as it plots the distribution of a single numerical variable.

Example: `iris` Data Set Histograms

Plotting the distribution of iris sepal lengths by species

# establish data and variables
ggplot(iris_tibble, 
       aes(x = Sepal.Length, 
           fill = Species)) +
  # draw histogram
  geom_histogram(bins=10) + 
  # add theme layer
  theme_grey()

$\star$ The geom_histogram() function allows you to adjust the number of bins, affecting how the data is visualized. We will discuss this later.

Basics of Exploring Numerical Data

Variables

Types of Data: Continuous vs Discrete
Descriptive Statistics: Measures of center (mean, median, mode) and spread (range, variance, standard deviation, IQR)

Visualization Techniques

Dotplots: Understanding frequency distributions
Histograms: Viewing shapes of distributions
Boxplots: Detecting outliers and spread
Scatterplots: Identifying relationships between variables

Scatterplots

Scatterplots are useful for visualizing the relationship between two numerical variables.

Example:

The variables Sepal.Length and Petal.Length in the iris data set appears to have a positive association. As Sepal.Length increases, Petal.Length also increases.

Dot Plots

Dot plots are useful for visualizing one numerical variable. Darker colors represent areas where there are more observations.

How would you describe the distribution of GPAs in this data set?

Dot Plots and the Mean

The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data.

The mean GPA is 3.59.

Histograms

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.

Histograms are especially convenient for describing the shape of the data distribution.
The chosen bin width can alter the story the histogram is telling.

Bin Width of Histograms

Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much?

Distribution Shapes: Modality

Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?

$\star$ Note: In order to determine modality, step back and imagine a smooth curve over the histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve.

Distribution Shapes: Skewness

Is the histogram right skewed, left skewed, or symmetric?

$\star$ Note: Histograms are said to be skewed to the side of the long tail.

Commonly Observed Distribution Shapes

Measures of Central Tendency

The measures of central tendency describe the central or typical value of a dataset, summarizing its distribution. The following are the common measures of central tendency:

Mean: The arithmetic average of all data points, calculated as the sum of all values divided by the total number of values.
Median: The middle value of an ordered dataset. If there is an even number of observations, the median is the average of the two middle values.
Mode: The most frequently occurring value(s) in a dataset. A dataset may have one mode (unimodal), multiple modes (multimodal), or no mode if all values occur with equal frequency.

Skewness and Measures Central Tendency (1/3)

Mode $<$ Median $<$ Mean

If the distribution of data is skewed to the right, the mode is less than the median, which is less than the mean.

Skewness and Measures Central Tendency (2/3)

Mean $<$ Median $<$ Mode

If the distribution of data is skewed to the left, the mean is less than the median, which is less than the mode.

Skewness and Measures Central Tendency (3/3)

Mean $=$ Median $=$ Mode

If the distribution of data is symmetric, the mean is roughly equal to the median, which is roughly equal to the mode.

Measures of Dispersion (Spread)

The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:

Variance: The average squared deviation from the mean.
Standard Deviation: The square root of the variance, providing a measure of spread in the same units as the data.
Range: The difference between the maximum and minimum values.
Interquartile Range (IQR): The difference between the 75th and 25th percentiles, capturing the spread of the middle 50% of the data.

Making Sense of the Variance

Common variance interpretations:

Zero Variance: All values are the same.
Low Variance: Data points are close to the mean (consistent data).
High Variance: Data points are spread out (greater variability).

Example: Order the following distributions from low to high variance.

The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.

Making Sense of the Standard Deviation

Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.

Example: Suppose we analyze the annual salaries of employees in two different companies:

Company	$\overline{x}$	$s^2$	$s$
A	$55K	$62.5K	$7.906K
B	$130K	$2250K	$47.434K

Even though Company B has higher salaries on average, its standard deviation is much larger, suggesting greater salary inequality.

$\star$ Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.

Box Plots

The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.

Box Plot of Study Hours

Histogram of Study Hours

$\star$ Key Idea: Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.

Anatomy of Box Plots

Main Parts:

Box
- Defined by the quartiles $Q_1$, $Q_2$ (median), and $Q_3$.
- The IQR defines the length of the Box.
Whiskers
- Lower whisker is defined as $Q_1 - 1.5 \times \text{IQR}$.
- Upper whisker is defined as $Q_3 + 1.5 \times \text{IQR}$.
Outliers
- Data points that are placed beyond the whiskers.

Computing the Whiskers

Whiskers of a box plot can extend up to $1.5 \times IQR$ away from the quartiles. The $1.5 \times \text{IQR}$ is arbitrary, and is considered an academic standard and the default in plotting box plots.

Example: Suppose that $Q_1 = 10$ and $Q_3 = 20$.

\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]

$\star$ A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.

More Boxplot and Dot Plot Examples

Example: The boxplot and stacked dot plot of the data set $7,1,2,4,6,3,2,7$ is shown below.

Box Plot

Stacked Dot Plot

$\star$ A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.

Why Outliers are Important?

Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.

Case Study 1

How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?

Case Study 1: Extreme Observations

data	$\overline{x}$	$s$	Median	IQR
original	244,778.6	225,903.8	190,000	200,000
replace largest with $10M	566,207.1	1,830,200.6	190,000	200,000
replace smallest with $10M	316,121.4	854,489.2	200,000	200,000

$\star$ The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Activity: Visualize Numerical Data

Log-in to Posit Cloud and open the R Studio assignment F 1/31 - Visualize Numerical Data.
Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
Change the author in the YAML header.
Read the provided instructions.
Answer all exercise problems on the designated sections.

Basics of Visualizations &
Exploring Numerical Data

Applied Statistics

Objectives

Previously… (1/2)

Previously… (2/2)

`tidyverse` Core Packages for Data Visualizations

Data Visualization Using `ggplot2`

The Grammar for Graphics

Key Components of `ggplot2`

Example: `iris` Data Set Scatter Plots

Aesthetics

Layering

Common Geometries for Numerical Variables

Example: `iris` Data Set Histograms

Basics of Exploring Numerical Data

Scatterplots

Dot Plots

Dot Plots and the Mean

Histograms

Bin Width of Histograms

Distribution Shapes: Modality

Distribution Shapes: Skewness

Commonly Observed Distribution Shapes

Measures of Central Tendency

Skewness and Measures Central Tendency (1/3)

Skewness and Measures Central Tendency (2/3)

Skewness and Measures Central Tendency (3/3)

Measures of Dispersion (Spread)

Making Sense of the Variance

Making Sense of the Standard Deviation

Box Plots

Anatomy of Box Plots

Computing the Whiskers

More Boxplot and Dot Plot Examples

Why Outliers are Important?

Case Study 1

Case Study 1: Extreme Observations

Robust Statistics

Activity: Visualize Numerical Data

References

Basics of Visualizations & Exploring Numerical Data

Applied Statistics

Objectives

Previously… (1/2)

Previously… (2/2)

tidyverse Core Packages for Data Visualizations

Data Visualization Using ggplot2

The Grammar for Graphics

Key Components of ggplot2

Example: iris Data Set Scatter Plots

Aesthetics

Layering

Common Geometries for Numerical Variables

Example: iris Data Set Histograms

Basics of Exploring Numerical Data

Scatterplots

Dot Plots

Dot Plots and the Mean

Histograms

Bin Width of Histograms

Distribution Shapes: Modality

Distribution Shapes: Skewness

Commonly Observed Distribution Shapes

Measures of Central Tendency

Skewness and Measures Central Tendency (1/3)

Skewness and Measures Central Tendency (2/3)

Skewness and Measures Central Tendency (3/3)

Measures of Dispersion (Spread)

Making Sense of the Variance

Making Sense of the Standard Deviation

Box Plots

Anatomy of Box Plots

Computing the Whiskers

More Boxplot and Dot Plot Examples

Why Outliers are Important?

Case Study 1

Case Study 1: Extreme Observations

Robust Statistics

Activity: Visualize Numerical Data

References

Basics of Visualizations &
Exploring Numerical Data

`tidyverse` Core Packages for Data Visualizations

Data Visualization Using `ggplot2`

Key Components of `ggplot2`

Example: `iris` Data Set Scatter Plots

Example: `iris` Data Set Histograms