Examining Numerical Data
Elementary Statistics
MTH-161D | Spring 2025 | University of Portland
February 5, 2025
Objectives
- Understand how numerical data is examined through
visualizations
- Develop an understanding of various distribution
shapes
- Know how the measures central tendency relate to
distribution shapes
- Activity: Identify the Shape of Distribution
These slides are derived from Diez et al.
(2012).
Previously… (2/3)
Exploratory Analysis
It is the process of analyzing and summarizing datasets to uncover
patterns, trends, relationships, and anomalies before inference.
Descriptive statistics
It involves organizing, summarizing, and presenting data in an
informative way. It Focuses on describing and understanding the main
features of a dataset.
For Numerical Variables
- Measures of Central Tendency
- Mean (Average),
Median, and
Mode
- Measures of Dispersion (Spread)
- Range, Variance, Standard Deviation, Interquartile Range (IQR)
For Categorical Variables
- Frequency
- Relative Frequency
(Proportion)
- Percentage
Previously… (3/3)
Inference
It is the process of drawing conclusions about a population based on
sample data. This involves using data from a sample to make
generalizations, predictions, or decisions about a larger group.
- Population: The entire group of individuals or
items that a study aims to understand.
- Sample: A subset of the population selected for
analysis to make inferences about the whole.
- Sampling Bias: A distortion in results caused by a
non-representative sample.
- Random Sampling: A method of selecting a sample
where each member of the population has an equal chance of being
chosen.
Scatterplots
Scatterplots are useful for visualizing the
relationship between two numerical variables.
Do life expectancy and total fertility appear to be associated or
independent?
They appear to be linearly and negatively associated: as fertility
increases, life expectancy decreases.
Dot Plots
Dot plots are useful for visualizing one numerical
variable. Darker colors represent areas where there are more
observations.

How would you describe the distribution of GPAs in this data set?
Dot Plots and the Mean

The mean, also
called the average
(marked with a triangle in the above plot), is one way to measure the
center of a distribution of data.
The mean GPA is 3.59.
Stacked Dot Plots
Higher bars represent areas where there are more observations, makes
it a little easier to judge the center and the shape of the
distribution.

Histograms
Histograms provide a view of the data
density. Higher bars represent where the data are relatively more
common.
- Histograms are especially convenient for describing the
shape of the data distribution.
- The chosen bin width can alter the story the
histogram is telling.
Bin Width of Histograms
Which one(s) of these histograms are useful? Which reveal too much
about the data? Which hide too much?

Distribution Shapes: Modality
Does the histogram have a single prominent peak
(unimodal), several prominent peaks
(bimodal/multimodal), or no apparent peaks
(uniform)?

\(\star\) Note: In
order to determine modality, step back and imagine a smooth curve over
the histogram – imagine that the bars are wooden blocks and you drop a
limp spaghetti over them, the shape the spaghetti would take could be
viewed as a smooth curve.
Distribution Shapes: Skewness
Is the histogram right skewed, left
skewed, or symmetric?

\(\star\) Note:
Histograms are said to be skewed to the side of the long tail.
Commonly Observed Distribution Shapes

Measures of Central Tendency
The measures of central tendency describe the
central or typical value of a dataset, summarizing its distribution. The
following are the common measures of central tendency:
- Mean: The arithmetic average of all data points,
calculated as the sum of all values divided by the total number of
values.
- Median: The middle value of an ordered dataset. If
there is an even number of observations, the median is the average of
the two middle values.
- Mode: The most frequently occurring value(s) in a
dataset. A dataset may have one mode (unimodal), multiple modes
(multimodal), or no mode if all values occur with equal frequency.
Skewness and Measures Central Tendency (1/3)
Mode \(<\)
Median \(<\)
Mean
- If the distribution of data is skewed to the right,
the mode is less than the median, which is less than the
mean.
Skewness and Measures Central Tendency (2/3)
Mean \(<\)
Median \(<\)
Mode
- If the distribution of data is skewed to the left,
the mean is less than the median, which is less than the
mode.
Skewness and Measures Central Tendency (3/3)
Mean \(=\)
Median \(=\)
Mode
- If the distribution of data is symmetric, the
mean is roughly equal to the median, which is roughly equal to
the mode.
Activity: Identify the Shape of Distribution
- Make sure you have a copy of the W 2/5 Worksheet. This will
be handed out physically and it is also digitally available on
Moodle.
- Work on your worksheet by yourself for 10 minutes. Please read the
instructions carefully. Ask questions if anything need
clarifications.
- Get together with another student.
- Discuss your results.
- Submit your worksheet on Moodle as a
.pdf
file.