Summarizing Numerical Data

Elementary Statistics

MTH-161D | Spring 2025 | University of Portland

February 7, 2025

Objectives

These slides are derived from Diez et al. (2012).

Previously… (1/2)

Exploratory Analysis

It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.

Descriptive statistics

It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

For Categorical Variables

Previously… (2/2)

Commonly Observed Distribution Shapes

Measures of Dispersion (Spread)

The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:

Variance

The variance is roughly the average squared deviation from the mean.

The formula for the variance is given by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots (x_n - \bar{x})^2}{n-1}\] where \(x_1, x_2, \cdots, x_n\) are the data points, \(\bar{x}\) is the sample mean, and \(n\) is the sample size.

What is the meaning of the variance?

Computing the Variance

Example: What is the variance of the data set \(7,1,2,4,6,3,2,7\)?

\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]

num_data <- c(7,1,2,4,6,3,2,7)
var(num_data)
## [1] 5.714286

So, the variance is \(5.714\).

Variance is Always Positive

Why do we use the squared deviation in the calculation of variance?

\(\star\) Variance is the average of the squared differences between each data point and the mean.

Making Sense of the Variance

Common variance interpretations:

Example: Order the following distributions from low to high variance.

The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.

Standard Deviation (SD)

The standard deviation (SD) is the square root of the variance, and has the same units as the data.

The formula for the standard deviation is given by \[s = \sqrt{s^2}\] where \(s^2\) is the variance.

What is the meaning of the standard deviation?

Computing the SD

Example: What is the standard deviation of the data set \(7,1,2,4,6,3,2,7\)?

\[ \begin{aligned} \text{mean} \longrightarrow \bar{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]

\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]

\[ \text{standard deviation} \longrightarrow s = \sqrt{5.714} = 2.390 \]

num_data <- c(7,1,2,4,6,3,2,7)
sd(num_data)
## [1] 2.390457

So, the standard deviation is \(2.390\).

Making Sense of the Standard Deviation

Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.

Example: Suppose we analyze the annual salaries of employees in two different companies:

Company \(\overline{x}\) \(s^2\) \(s\)
A $55K $62.5K $7.906K
B $130K $2250K $47.434K

\(\star\) Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.

Range

The range is the difference between the maximum and minimum of the numerical data.

The formula for the range is given by \[\text{range} = x_{max} - x_{min}\] where \(x_{max}\) is the maximum value and \(x_{min}\) minimum value.

Computing the Range

Example: What is the range of the data set \(7,1,2,4,6,3,2,7\)?

\[\text{sorted data} \longrightarrow \color{blue}{\mathbf{1}},2,2,3,4,6,7,\color{blue}{\mathbf{7}}\] \[\begin{aligned} \text{range} & = \color{blue}{\mathbf{7}} - \color{blue}{\mathbf{1}} \\ & = 6 \end{aligned}\]
num_data <- c(7,1,2,4,6,3,2,7)
max(num_data) - min(num_data)
## [1] 6

So, the range is \(6\).

Percentiles

A percentile is a measure used to indicate the value below which a given percentage of observations fall.

The formula for computing the percentile rank and the percentile it given by \[\text{percentile of } x = \frac{\text{number of values below } x}{\text{total number of values}} \times 100\] where \(x\) is a value in the data.

\(\star\) Key Idea: The percentile is the value below which a certain percentage of the data lies.

Computing Percentiles

Example: What is the percentile of \(6\) in the data set \(7,1,2,4,6,3,2,7\)?

\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{red}{\mathbf{2}},\color{red}{\mathbf{3}},\color{red}{\mathbf{4}},\color{blue}{\mathbf{6}},7,7\]

\[ \begin{aligned} \text{percentile of } \color{blue}{\mathbf{6}} & = \frac{5}{8} \times 100 \\ & = 62.5 \end{aligned} \]

So, the data value \(6\) is in the \(62.5\)th percentile, or 62.5% of the data is below \(6\).

Computing Percentiles in Reverse

Example: What is the 30th percentile of the data set \(7,1,2,4,6,3,2,7\)? (What is the value in the data below which 30% of the data lies?)

\[ \begin{aligned} 30\% & = \frac{\text{number of values below } x}{8} \times 100 \\ 0.30 \times 8 & = \text{number of values below } x \\ 2.40 & = \text{number of values below } x \\ 2 & \longleftarrow \text{rounded to nearest integer} \end{aligned} \]

\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{blue}{\mathbf{2}},3,4,6,7,7\]

\[ \begin{aligned} 30\text{th percentile} & \approx 2 \end{aligned} \]

num_data <- c(7,1,2,4,6,3,2,7)
quantile(num_data,c(0.30))
## 30% 
## 2.1

So, the \(30\)th percentile is approximately \(2\), or precisely \(2.1\). This is a consequence of considering a small dataset when computing by hand.

\(\star\) Note: Whether the \(30\)th percentile is exactly \(2\) or \(2.1\) depends on the dataset and the method used to compute percentiles. In fact, \(2\) is exactly the \(25\)th percentile and \(2.1\) is exactly the \(30\)th percentile, but due to approximation, the values are close.

\(\mathbf{Q}_1\), \(\mathbf{Q}_2\), \(\mathbf{Q}_3\), and the IQR

\(\mathbf{Q}_1\) (the 1st quartile), \(\mathbf{Q}_2\) (the 2nd quartile), \(\mathbf{Q}_3\) (the 3rd quartile), and the IQR (interquartile range) are statistical measures used to describe the spread and distribution of a dataset:

\(\star\) Key Idea: The numerical data is divided into four sections (quartiles), which is saying that the data is split into four equal parts, each containing 25% (\(Q_1\)), 50% (\(Q_2\)), and 75% (\(Q_3\)) of the observations when arranged in ascending order.

Quantiles

In general, quartiles are called quantiles, which are values that split sorted data into equal parts. Quartiles are just quantiles where we split the data into four parts.

Computing the Quartiles

Example: What are the quartiles of the data set \(7,1,2,4,6,3,2,7\)?

\[\text{sorted data} \longrightarrow 1,2,2,3,4,6,7,7\]

Note that the number of data points is \(8\), an even number.

\[ \begin{aligned} 25\text{th percentile} & \approx 2 \\ 50\text{th percentile (median)} & = \frac{3+4}{2} = 3.50 \\ 75\text{th percentile} & \approx 6 \\ \end{aligned} \]

Note that these are approximations due to the small dataset size, but the concept of percentiles still holds.

num_data <- c(7,1,2,4,6,3,2,7)
quantile(num_data)
##   0%  25%  50%  75% 100% 
## 1.00 2.00 3.50 6.25 7.00

So, the quartiles are \(Q_1 = 2\), \(Q_2 = 3.50\), and \(Q_3 = 6.25\).

Computing the IQR

Example: What is the IQR of the data set \(7,1,2,4,6,3,2,7\)?

\[ \begin{aligned} & Q_1 \approx 2 \\ & Q_2 \text{ (median)} = \frac{3+4}{2} = 3.50 \\ & Q_3 \approx 6.25 \end{aligned} \]

\[ \begin{aligned} \text{IQR} & = Q_3 - Q_1 \\ & = 6.25 - 2 \\ & = 4.25 \end{aligned} \]

num_data <- c(7,1,2,4,6,3,2,7)
IQR(num_data)
## [1] 4.25

So, the IQR is \(4.25\).

Box Plots

The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.

Box Plot of Study Hours

Box Plot of Study Hours

Histogram of Study Hours

Histogram of Study Hours

\(\star\) Key Idea: Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.

Anatomy of Box Plots

Main Parts:

Computing the Whiskers

Whiskers of a box plot can extend up to \(1.5 \times IQR\) away from the quartiles. The \(1.5 \times \text{IQR}\) is arbitrary, and is considered an academic standard and the default in plotting box plots.

Example: Suppose that \(Q_1 = 10\) and \(Q_3 = 20\).

\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]

\(\star\) A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.

More Boxplot and Dot Plot Examples

Example: The boxplot and stacked dot plot of the data set \(7,1,2,4,6,3,2,7\) is shown below.

Box Plot

Box Plot

Stacked Dot Plot

Stacked Dot Plot

\(\star\) A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.

Why Outliers are Important?

Case Study 1

How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?

Case Study 1: Extreme Observations

data \(\overline{x}\) \(s\) Median IQR
original 244,778.6 225,903.8 190,000 200,000
replace largest with $10M 566,207.1 1,830,200.6 190,000 200,000
replace smallest with $10M 316,121.4 854,489.2 200,000 200,000

\(\star\) The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

Activity: Compute Summary Statistics and Visualize Numerical Data

  1. Make sure you have a copy of the F 2/7 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
  2. Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
  3. Get together with another student.
  4. Discuss your results.
  5. Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/