MTH-361A | Spring 2026 | University of Portland
Variables
Visualization Techniques
Scatterplots are useful for visualizing the relationship between two numerical variables.
Example:
Sepal.Length and
Petal.Length in the iris data set appears to
have a positive association, meaning as Sepal.Length
increases, Petal.Length also increases.Dot plots are useful for visualizing one numerical variable. Darker colors represent areas where there are more observations.
The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data. The mean GPA is 3.59.
Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.
Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much?
Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?
\(\star\) In order to determine modality, step back and imagine a smooth curve over the histogram –imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve.
Is the histogram right skewed, left skewed, or symmetric?
\(\star\) Histograms are said to be skewed to the side of the long tail.
The measures of central tendency describe the central or typical value of a dataset, summarizing its distribution. The following are the common measures of central tendency:
Mode \(<\) Median \(<\) Mean
Mean \(<\) Median \(<\) Mode
Mean \(=\) Median \(=\) Mode
The measures of dispersion describes the variability of a numerical data. It is a way to quantify the uncertainty of a distribution. The following are the common measures of dispersion:
Common variance interpretations:
Example: Order the following distributions from low to high variance.
The above distributions are ordered from lowest to highest variance as follows: B, D, C, and A.
Variance gives a broader measure of spread, while standard deviation provides a more practical understanding of dispersion.
Example: Suppose we analyze the annual salaries of employees in two different companies:
| Company | \(\overline{x}\) | \(s^2\) | \(s\) |
|---|---|---|---|
| A | $55K | $62.5K | $7.906K |
| B | $130K | $2250K | $47.434K |
\(\star\) Standard deviation helps compare the spread of different distributions of the same units, which is directly interpretable as the typical deviation from the mean.
The box in a box plot represents the middle 50% of the data, and the thick line in the box is the median.
Box Plot of Study Hours
Histogram of Study Hours
\(\star\) Box plots and histograms visualize numerical data, with histograms showing distribution shape, while box plots summarize spread and outliers, making them better for comparisons.
Main Parts:
Whiskers of a box plot can extend up to \(1.5 \times IQR\) away from the quartiles. The \(1.5 \times \text{IQR}\) is arbitrary, and is considered an academic standard and the default in plotting box plots.
Example: Suppose that \(Q_1 = 10\) and \(Q_3 = 20\).
\[\text{IQR} = 20 - 10 = 10\] \[\text{max upper whisker reach} = 20 + 1.5 \times 10 = 35\] \[\text{max lower whisker reach} = 10 - 1.5 \times 10 = -5\]
\(\star\) A potential outlier is defined as an observation beyond the maximum reach of the whiskers. It is an observation that appears extreme relative to the rest of the data.
Example: The boxplot and stacked dot plot of the data set \(7,1,2,4,6,3,2,7\) is shown below.
Box Plot
Stacked Dot Plot
\(\star\) A dot plot is used instead of a histogram for convenience with the small discrete numbered dataset. The box plot shows no potential outliers, as all data points fall within the whiskers.
How would sample statistics such as mean, median, SD, and IQR of household income be affected if the largest value was replaced with $10 million? What if the smallest value was replaced with $10 million?
| data | \(\overline{x}\) | \(s\) | Median | IQR |
|---|---|---|---|---|
| original | 244,778.6 | 225,903.8 | 190,000 | 200,000 |
| replace largest with $10M | 566,207.1 | 1,830,200.6 | 190,000 | 200,000 |
| replace smallest with $10M | 316,121.4 | 854,489.2 | 200,000 | 200,000 |
\(\star\) The table shows that shifting specific values to the extreme significantly affects the mean but not the median, indicating the mean’s sensitivity to extreme observations. Similarly, the standard deviation is affected, while the IQR remains the same.
Median and IQR are more robust to skewness and outliers than mean and SD. Therefore,