MTH-161D | Spring 2025 | University of Portland
February 7, 2025
These slides are derived from Diez et al. (2012).
Exploratory Analysis
It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.
Descriptive statistics
It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.
For Numerical Variables
For Categorical Variables
Commonly Observed Distribution Shapes
The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:
The variance is roughly the average squared deviation from the mean.
The formula for the variance is given by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2 - \bar{x})^2 + \cdots (x_n - \bar{x})^2}{n-1}\] where \(x_1, x_2, \cdots, x_n\) are the data points, \(\bar{x}\) is the sample mean, and \(n\) is the sample size.
What is the meaning of the variance?
Example: What is the variance of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]
So, the variance is \(5.714\).
Why do we use the squared deviation in the calculation of variance?
\(\star\) Variance is the average of the squared differences between each data point and the mean.
Common variance interpretations:
Example: Order the following distributions from low to high variance.
The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.
The standard deviation (SD) is the square root of the variance, and has the same units as the data.
The formula for the standard deviation is given by \[s = \sqrt{s^2}\] where \(s^2\) is the variance.
What is the meaning of the standard deviation?
Example: What is the standard deviation of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} \text{mean} \longrightarrow \bar{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]
\[ \begin{aligned} \text{variance} \longrightarrow s^2 & = \frac{\begin{matrix} (7-4)^2 + (1-4)^2 + \\ (2-4)^2 + (4-4)^2 + \\ (6-4)^2 + (3-4)^2 + \\ (2-4)^2 + (7-4)^2 \end{matrix}}{8-1} \\ & = 5.714 \end{aligned} \]
\[ \text{standard deviation} \longrightarrow s = \sqrt{5.714} = 2.390 \]
So, the standard deviation is \(2.390\).
Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.
Example: Suppose we analyze the annual salaries of employees in two different companies:
Company | \(\overline{x}\) | \(s^2\) | \(s\) |
---|---|---|---|
A | $55K | $62.5K | $7.906K |
B | $130K | $2250K | $47.434K |
\(\star\) Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.
The range is the difference between the maximum and minimum of the numerical data.
The formula for the range is given by \[\text{range} = x_{max} - x_{min}\] where \(x_{max}\) is the maximum value and \(x_{min}\) minimum value.
Example: What is the range of the data set \(7,1,2,4,6,3,2,7\)?
So, the range is \(6\).
A percentile is a measure used to indicate the value below which a given percentage of observations fall.
The formula for computing the percentile rank and the percentile it given by \[\text{percentile of } x = \frac{\text{number of values below } x}{\text{total number of values}} \times 100\] where \(x\) is a value in the data.
\(\star\) Key Idea: The percentile is the value below which a certain percentage of the data lies.
Example: What is the percentile of \(6\) in the data set \(7,1,2,4,6,3,2,7\)?
\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{red}{\mathbf{2}},\color{red}{\mathbf{3}},\color{red}{\mathbf{4}},\color{blue}{\mathbf{6}},7,7\]
\[ \begin{aligned} \text{percentile of } \color{blue}{\mathbf{6}} & = \frac{5}{8} \times 100 \\ & = 62.5 \end{aligned} \]
So, the data value \(6\) is in the \(62.5\)th percentile, or 62.5% of the data is below \(6\).
Example: What is the 30th percentile of the data set \(7,1,2,4,6,3,2,7\)? (What is the value in the data below which 30% of the data lies?)
\[ \begin{aligned} 30\% & = \frac{\text{number of values below } x}{8} \times 100 \\ 0.30 \times 8 & = \text{number of values below } x \\ 2.40 & = \text{number of values below } x \\ 2 & \longleftarrow \text{rounded to nearest integer} \end{aligned} \]
\[\text{sorted data} \longrightarrow \color{red}{\mathbf{1}},\color{red}{\mathbf{2}},\color{blue}{\mathbf{2}},3,4,6,7,7\]
\[ \begin{aligned} 30\text{th percentile} & \approx 2 \end{aligned} \]
So, the \(30\)th percentile is approximately \(2\), or precisely \(2.1\). This is a consequence of considering a small dataset when computing by hand.
\(\star\) Note: Whether the \(30\)th percentile is exactly \(2\) or \(2.1\) depends on the dataset and the method used to compute percentiles. In fact, \(2\) is exactly the \(25\)th percentile and \(2.1\) is exactly the \(30\)th percentile, but due to approximation, the values are close.
\(\mathbf{Q}_1\) (the 1st quartile), \(\mathbf{Q}_2\) (the 2nd quartile), \(\mathbf{Q}_3\) (the 3rd quartile), and the IQR (interquartile range) are statistical measures used to describe the spread and distribution of a dataset:
\(\star\) Key Idea: The numerical data is divided into four sections (quartiles), which is saying that the data is split into four equal parts, each containing 25% (\(Q_1\)), 50% (\(Q_2\)), and 75% (\(Q_3\)) of the observations when arranged in ascending order.
In general, quartiles are called quantiles, which are values that split sorted data into equal parts. Quartiles are just quantiles where we split the data into four parts.
Example: What are the quartiles of the data set \(7,1,2,4,6,3,2,7\)?
\[\text{sorted data} \longrightarrow 1,2,2,3,4,6,7,7\]
Note that the number of data points is \(8\), an even number.
\[ \begin{aligned} 25\text{th percentile} & \approx 2 \\ 50\text{th percentile (median)} & = \frac{3+4}{2} = 3.50 \\ 75\text{th percentile} & \approx 6 \\ \end{aligned} \]
Note that these are approximations due to the small dataset size, but the concept of percentiles still holds.
## 0% 25% 50% 75% 100%
## 1.00 2.00 3.50 6.25 7.00
So, the quartiles are \(Q_1 = 2\), \(Q_2 = 3.50\), and \(Q_3 = 6.25\).
Example: What is the IQR of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} & Q_1 \approx 2 \\ & Q_2 \text{ (median)} = \frac{3+4}{2} = 3.50 \\ & Q_3 \approx 6.25 \end{aligned} \]
\[ \begin{aligned} \text{IQR} & = Q_3 - Q_1 \\ & = 6.25 - 2 \\ & = 4.25 \end{aligned} \]
So, the IQR is \(4.25\).
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
Box Plot of Study Hours
Histogram of Study Hours
\(\star\) Key Idea: Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.
Main Parts:
Whiskers of a box plot can extend up to \(1.5 \times IQR\) away from the quartiles. The \(1.5 \times \text{IQR}\) is arbitrary, and is considered an academic standard and the default in plotting box plots.
Example: Suppose that \(Q_1 = 10\) and \(Q_3 = 20\).
\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]
\(\star\) A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.
Example: The boxplot and stacked dot plot of the data set \(7,1,2,4,6,3,2,7\) is shown below.
Box Plot
Stacked Dot Plot
\(\star\) A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.
How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?
data | \(\overline{x}\) | \(s\) | Median | IQR |
---|---|---|---|---|
original | 244,778.6 | 225,903.8 | 190,000 | 200,000 |
replace largest with $10M | 566,207.1 | 1,830,200.6 | 190,000 | 200,000 |
replace smallest with $10M | 316,121.4 | 854,489.2 | 200,000 | 200,000 |
\(\star\) The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,
.pdf
file.