Examining Numerical Data

Elementary Statistics

MTH-161D | Spring 2025 | University of Portland

February 5, 2025

Objectives

Understand how numerical data is examined through visualizations
Develop an understanding of various distribution shapes
Know how the measures central tendency relate to distribution shapes
Activity: Identify the Shape of Distribution

These slides are derived from Diez et al. (2012).

Previously… (1/3)

Types of Variables

Previously… (2/3)

Exploratory Analysis

It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.

Descriptive statistics

It involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

Measures of Central Tendency
- Mean (Average), Median, and Mode
Measures of Dispersion (Spread)
- Range, Variance, Standard Deviation, Interquartile Range (IQR)

For Categorical Variables

Frequency
Relative Frequency (Proportion)
Percentage

Previously… (3/3)

Inference

It is the process of drawing conclusions about a population based on sample data. This involves using data from a sample to make generalizations, predictions, or decisions about a larger group.

Population: The entire group of individuals or items that a study aims to understand.
Sample: A subset of the population selected for analysis to make inferences about the whole.
Sampling Bias: A distortion in results caused by a non-representative sample.
Random Sampling: A method of selecting a sample where each member of the population has an equal chance of being chosen.

Scatterplots

Scatterplots are useful for visualizing the relationship between two numerical variables.

Do life expectancy and total fertility appear to be associated or independent?

They appear to be linearly and negatively associated: as fertility increases, life expectancy decreases.

http://www.gapminder.org/world

Dot Plots

Dot plots are useful for visualizing one numerical variable. Darker colors represent areas where there are more observations.

How would you describe the distribution of GPAs in this data set?

Dot Plots and the Mean

The mean, also called the average (marked with a triangle in the above plot), is one way to measure the center of a distribution of data.

The mean GPA is 3.59.

Stacked Dot Plots

Higher bars represent areas where there are more observations, makes it a little easier to judge the center and the shape of the distribution.

Histograms

Histograms provide a view of the data density. Higher bars represent where the data are relatively more common.

Histograms are especially convenient for describing the shape of the data distribution.
The chosen bin width can alter the story the histogram is telling.

Bin Width of Histograms

Which one(s) of these histograms are useful? Which reveal too much about the data? Which hide too much?

Distribution Shapes: Modality

Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)?

\(\star\) Note: In order to determine modality, step back and imagine a smooth curve over the histogram – imagine that the bars are wooden blocks and you drop a limp spaghetti over them, the shape the spaghetti would take could be viewed as a smooth curve.

Distribution Shapes: Skewness

Is the histogram right skewed, left skewed, or symmetric?

\(\star\) Note: Histograms are said to be skewed to the side of the long tail.

Commonly Observed Distribution Shapes

Measures of Central Tendency

The measures of central tendency describe the central or typical value of a dataset, summarizing its distribution. The following are the common measures of central tendency:

Mean: The arithmetic average of all data points, calculated as the sum of all values divided by the total number of values.
Median: The middle value of an ordered dataset. If there is an even number of observations, the median is the average of the two middle values.
Mode: The most frequently occurring value(s) in a dataset. A dataset may have one mode (unimodal), multiple modes (multimodal), or no mode if all values occur with equal frequency.

Skewness and Measures Central Tendency (1/3)

Mode \(<\) Median \(<\) Mean

If the distribution of data is skewed to the right, the mode is less than the median, which is less than the mean.

Skewness and Measures Central Tendency (2/3)

Mean \(<\) Median \(<\) Mode

If the distribution of data is skewed to the left, the mean is less than the median, which is less than the mode.

Skewness and Measures Central Tendency (3/3)

Mean \(=\) Median \(=\) Mode

If the distribution of data is symmetric, the mean is roughly equal to the median, which is roughly equal to the mode.

Activity: Identify the Shape of Distribution

Make sure you have a copy of the W 2/5 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
Get together with another student.
Discuss your results.
Submit your worksheet on Moodle as a .pdf file.