Summarizing Numerical Data

Elementary Statistics

MTH-161D | Spring 2025 | University of Portland

February 7, 2025

Objectives

Know how to compute measures of spread
Understand how the measures of spread describe a distribution
Develop an understanding of robust statistics
Activity: Compute Summary Statistics and Visualize Numerical Data

These slides are derived from Diez et al. (2012).

Previously… (1/2)

Exploratory Analysis

It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.

Descriptive statistics

It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

Measures of Central Tendency
- Mean (Average), Median, and Mode
Measures of Dispersion (Spread)
- Range, Variance, Standard Deviation, Interquartile Range (IQR)

For Categorical Variables

Frequency
Relative Frequency (Proportion)
Percentage

Previously… (2/2)

Commonly Observed Distribution Shapes

Measures of Dispersion (Spread)

The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:

Variance: The average squared deviation from the mean.
Standard Deviation: The square root of the variance, providing a measure of spread in the same units as the data.
Range: The difference between the maximum and minimum values.
Interquartile Range (IQR): The difference between the 75th and 25th percentiles, capturing the spread of the middle 50% of the data.

Variance

The variance is roughly the average squared deviation from the mean.

The formula for the variance is given by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots (x_n - \bar{x})^2}{n-1}\] where $x_1, x_2, \cdots, x_n$ are the data points, $\bar{x}$ is the sample mean, and $n$ is the sample size.

What is the meaning of the variance?

A measure of how spread out data points are around the mean.
Indicates the level of uncertainty or variability in a dataset.

Computing the Variance

Example: What is the variance of the data set $7,1,2,4,6,3,2,7$?

Manual Computation: \[ \begin{aligned} \text{mean} \longrightarrow \bar{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]

\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
var(num_data)

## [1] 5.714286

So, the variance is $5.714$.

Variance is Always Positive

Why do we use the squared deviation in the calculation of variance?

To get rid of negatives so that observations equally distant from the mean are weighed equally.
To weigh larger deviations more heavily.

$\star$ Variance is the average of the squared differences between each data point and the mean.

Making Sense of the Variance

Common variance interpretations:

Zero Variance: All values are the same.
Low Variance: Data points are close to the mean (consistent data).
High Variance: Data points are spread out (greater variability).

Example: Order the following distributions from low to high variance.

The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.

Standard Deviation (SD)

The standard deviation (SD) is the square root of the variance, and has the same units as the data.

The formula for the standard deviation is given by \[s = \sqrt{s^2}\] where $s^2$ is the variance.

What is the meaning of the standard deviation?

Standard deviation is a measure of how spread out the values in a dataset are.
It quantifies the amount of variation or dispersion of a set of data points.
Helps understand data consistency.

Computing the SD

Example: What is the standard deviation of the data set $7,1,2,4,6,3,2,7$?

Manual Computation:

\[ \begin{aligned} \text{mean} \longrightarrow \bar{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]

\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]

\[ \text{standard deviation} \longrightarrow s = \sqrt{5.714} = 2.390 \]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
sd(num_data)

## [1] 2.390457

So, the standard deviation is $2.390$.

Making Sense of the Standard Deviation

Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.

Example: Suppose we analyze the annual salaries of employees in two different companies:

Company	$\overline{x}$	$s^2$	$s$
A	$55K	$62.5K	$7.906K
B	$130K	$2250K	$47.434K

Even though Company B has higher salaries on average, its standard deviation is much larger, suggesting greater salary inequality.

$\star$ Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.

Range

The range is the difference between the maximum and minimum of the numerical data.

The formula for the range is given by \[\text{range} = x_{max} - x_{min}\] where $x_{max}$ is the maximum value and $x_{min}$ minimum value.

Computing the Range

Example: What is the range of the data set $7,1,2,4,6,3,2,7$?

Manual Computation:

\[\text{sorted data} \longrightarrow \color{blue}{\mathbf{1}},2,2,3,4,6,7,\color{blue}{\mathbf{7}}\] \[\begin{aligned} \text{range} & = \color{blue}{\mathbf{7}} - \color{blue}{\mathbf{1}} \\ & = 6 \end{aligned}\]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
max(num_data) - min(num_data)

## [1] 6

So, the range is $6$.

Percentiles

A percentile is a measure used to indicate the value below which a given percentage of observations fall.

The formula for computing the percentile rank and the percentile it given by \[\text{percentile of } x = \frac{\text{number of values below } x}{\text{total number of values}} \times 100\] where $x$ is a value in the data.

$\star$ Key Idea: The percentile is the value below which a certain percentage of the data lies.

Computing Percentiles

Example: What is the percentile of $6$ in the data set $7,1,2,4,6,3,2,7$?

Manual Computation:

\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{red}{\mathbf{2}},\color{red}{\mathbf{3}},\color{red}{\mathbf{4}},\color{blue}{\mathbf{6}},7,7\]

\[ \begin{aligned} \text{percentile of } \color{blue}{\mathbf{6}} & = \frac{5}{8} \times 100 \\ & = 62.5 \end{aligned} \]

So, the data value $6$ is in the $62.5$th percentile, or 62.5% of the data is below $6$.

Computing Percentiles in Reverse

Example: What is the 30th percentile of the data set $7,1,2,4,6,3,2,7$? (What is the value in the data below which 30% of the data lies?)

Manual Computation:

\[ \begin{aligned} 30\% & = \frac{\text{number of values below } x}{8} \times 100 \\ 0.30 \times 8 & = \text{number of values below } x \\ 2.40 & = \text{number of values below } x \\ 2 & \longleftarrow \text{rounded to nearest integer} \end{aligned} \]

\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{blue}{\mathbf{2}},3,4,6,7,7\]

\[ \begin{aligned} 30\text{th percentile} & \approx 2 \end{aligned} \]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
quantile(num_data,c(0.30))

## 30% 
## 2.1

So, the $30$th percentile is approximately $2$, or precisely $2.1$. This is a consequence of considering a small dataset when computing by hand.

$\star$ Note: Whether the $30$th percentile is exactly $2$ or $2.1$ depends on the dataset and the method used to compute percentiles. In fact, $2$ is exactly the $25$th percentile and $2.1$ is exactly the $30$th percentile, but due to approximation, the values are close.

$\mathbf{Q}_1$, $\mathbf{Q}_2$, $\mathbf{Q}_3$, and the IQR

$\mathbf{Q}_1$ (the 1st quartile), $\mathbf{Q}_2$ (the 2nd quartile), $\mathbf{Q}_3$ (the 3rd quartile), and the IQR (interquartile range) are statistical measures used to describe the spread and distribution of a dataset:

The 1st quartile is also called the 25th percentile, $Q_1$.
The 2nd quartile (median) is also called the 50th percentile, $Q_2$.
The 3rd quartile is also called the 75th percentile, $Q_3$.
Between $Q_1$ and $Q_3$ is the middle 50% of the data. The range these data span is called the interquartile range, or the IQR. \[\text{IQR} = Q_3 - Q_1\]

$\star$ Key Idea: The numerical data is divided into four sections (quartiles), which is saying that the data is split into four equal parts, each containing 25% ($Q_1$), 50% ($Q_2$), and 75% ($Q_3$) of the observations when arranged in ascending order.

Quantiles

In general, quartiles are called quantiles, which are values that split sorted data into equal parts. Quartiles are just quantiles where we split the data into four parts.

Computing the Quartiles

Example: What are the quartiles of the data set $7,1,2,4,6,3,2,7$?

Manual Computation:

\[\text{sorted data} \longrightarrow 1,2,2,3,4,6,7,7\]

Note that the number of data points is $8$, an even number.

\[ \begin{aligned} 25\text{th percentile} & \approx 2 \\ 50\text{th percentile (median)} & = \frac{3+4}{2} = 3.50 \\ 75\text{th percentile} & \approx 6 \\ \end{aligned} \]

Note that these are approximations due to the small dataset size, but the concept of percentiles still holds.

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
quantile(num_data)

##   0%  25%  50%  75% 100% 
## 1.00 2.00 3.50 6.25 7.00

So, the quartiles are $Q_1 = 2$, $Q_2 = 3.50$, and $Q_3 = 6.25$.

Computing the IQR

Example: What is the IQR of the data set $7,1,2,4,6,3,2,7$?

Manual Computation:

\[ \begin{aligned} & Q_1 \approx 2 \\ & Q_2 \text{ (median)} = \frac{3+4}{2} = 3.50 \\ & Q_3 \approx 6.25 \end{aligned} \]

\[ \begin{aligned} \text{IQR} & = Q_3 - Q_1 \\ & = 6.25 - 2 \\ & = 4.25 \end{aligned} \]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
IQR(num_data)

## [1] 4.25

So, the IQR is $4.25$.

Box Plots

The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.

Box Plot of Study Hours

Histogram of Study Hours

$\star$ Key Idea: Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.

Anatomy of Box Plots

Main Parts:

Box
- Defined by the quartiles $Q_1$, $Q_2$ (median), and $Q_3$.
- The IQR defines the length of the Box.
Whiskers
- Lower whisker is defined as $Q_1 - 1.5 \times \text{IQR}$.
- Upper whisker is defined as $Q_3 + 1.5 \times \text{IQR}$.
Outliers
- Data points that are placed beyond the whiskers.

Computing the Whiskers

Whiskers of a box plot can extend up to $1.5 \times IQR$ away from the quartiles. The $1.5 \times \text{IQR}$ is arbitrary, and is considered an academic standard and the default in plotting box plots.

Example: Suppose that $Q_1 = 10$ and $Q_3 = 20$.

\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]

$\star$ A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.

More Boxplot and Dot Plot Examples

Example: The boxplot and stacked dot plot of the data set $7,1,2,4,6,3,2,7$ is shown below.

Box Plot

Stacked Dot Plot

$\star$ A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.

Why Outliers are Important?

Identify extreme skew in the distribution.
Identify data collection and entry errors.
Provide insight into interesting features of the data.

Case Study 1

How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?

Case Study 1: Extreme Observations

data	$\overline{x}$	$s$	Median	IQR
original	244,778.6	225,903.8	190,000	200,000
replace largest with $10M	566,207.1	1,830,200.6	190,000	200,000
replace smallest with $10M	316,121.4	854,489.2	200,000	200,000

$\star$ The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.

Robust Statistics

Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,

for skewed distributions it is often more helpful to use median and IQR to describe the center and spread
for symmetric distributions it is often more helpful to use the mean and SD to describe the center and spread

Activity: Compute Summary Statistics and Visualize Numerical Data

Make sure you have a copy of the F 2/7 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
Get together with another student.
Discuss your results.
Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/