Data Basics &
Descriptive Statistics

Elementary Statistics

MTH-161D | Spring 2025 | University of Portland

January 24, 2025

Objectives

Develop an understanding of different types of variables
Know how to compute the mean, median, frequency, and proportion
Activity: Identify types of variables & compute descriptive statistics

Previously… (1/2)

The guiding principle of statistics is statistical thinking.

Statistical Thinking in the Data Science Life Cycle

Previously… (2/2)

Basic Parts of R Studio

Types of Variables

Numerical: Discrete vs Continuous

Numerical variables are quantitative variables that represent measurable amounts or quantities.

Discrete

Take distinct, separate values (often whole numbers).
Typically countable.
Examples:
- Number of students in a class (e.g. 25 students)
- Books sold per day (e.g. 10 books/day)

Continuous

Can take any value within a range, including fractions and decimals.
Measured rather than counted.
Examples:
- Height of a person (e.g., 5.7 ft).
- Time taken to complete a task (e.g., 3.25 hours).

Categorical: Nominal vs Ordinal

Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.

Nominal

Categories have no natural order or ranking.
Used for labeling without quantitative significance.
Examples:
- Colors of cars (e.g., red, blue, green).
- Types of animals (e.g., cat, dog, bird).

Ordinal

Categories have a meaningful order or ranking, but the intervals between ranks are not consistent.
Examples:
- Education level (e.g., high school, bachelor’s, master’s).
- Customer satisfaction (e.g., satisfied, neutral, dissatisfied).

Case Study 1

An experiment is evaluating the effectiveness of a new drug in treating migraines. A group variable is used to indicate the experiment group for each patient: treatment or control. The number of migraines variable represents the number of migraines the patient experienced during a 3-month period. Classify each variable as either numerical or categorical.

Variables:

The group variable is with categories treatment and control. So, group is a nominal categorical variable because the categories are with no discernible order.
The variable with number of migraines is a discrete numerical variable because the values are countable whole numbers.

R Variables

How are variables created and stored in R?

\(\dagger\) Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

Descriptive Statistics

Descriptive statistics involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

Measures of Central Tendency
- Mean (Average), Median, and Mode
Measures of Dispersion (Spread)
- Range, Variance, Standard Deviation, Interquartile Range (IQR)

For Categorical Variables

Frequency
Relative Frequency (Proportion)
Percentage

The Mean

The sample mean, denoted as \(\overline{x}\), can be calculated as \[\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where \(x_1, x_2, \cdots, x_n\) represent the \(n\) observed values.

In other words, the mean (or average) is the sum of all data points divided by the number of points: \[\text{mean} = \frac{\text{sum of all data points}}{\text{number of data points}}.\]

Computing the Mean

Example: What is the mean of the data set \(7,1,2,4,6,3,2,7\)?

Manual Computation:

\[ \begin{aligned} \overline{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
mean(num_data)

## [1] 4

So, the mean is \(4\).

The Median

The median is the middle value when the data is sorted.

The median is computed using the following cases:

Computing the Median

Example: What is the median of the data set \(7,1,2,4,6,3,2,7\)?

Manual Computation:

The number of data points is \(8\), an even number. \[\text{sorted data} \longrightarrow 1,2,2,\color{blue}{\mathbf{3}},\color{blue}{\mathbf{4}},6,7,7\]

\[ \begin{aligned} \text{median} & = \frac{\text{sum of two middle values}}{2} \\ & = \frac{\color{blue}{\mathbf{3}}+\color{blue}{\mathbf{4}}}{2} \\ & = 3.5 \end{aligned} \]

Using R:

num_data <- c(7,1,2,4,6,3,2,7)
sort(num_data)

## [1] 1 2 2 3 4 6 7 7

median(num_data)

## [1] 3.5

So, the median is \(3.5\).

The Frequency

The frequency is the number of observations in each category.

The method of computing the frequencies of a categorical variable is as follows:

List all unique categories in the data
Count the number of observations of each category
List the counts with their corresponding unique categories

Computing the Frequency

Example: How many b and g are there in the data listed below?

\[b, g, g, b, b, g, b, b\]

Manual Computation:

The unique categories are b and g.

Number of occurrences: \[ \begin{aligned} b,b,b,b,b & \longrightarrow 5 \\ g,g,g & \longrightarrow 3 \end{aligned} \]

Using R:

cat_data <- c("b","g","g","b","b","g","b","b")
table(cat_data)

## cat_data
## b g 
## 5 3

So, there are \(5\) b and \(3\) g.

The Relative Frequency and Percentage

The relative frequency is the proportion of observations in each category.

The proportion is computed using the formula \[\text{proportion of a category} = \frac{\text{number of cases of a category}}{\text{total number of cases}}.\]

The percentage is the relative frequency multiplied by 100: \[\text{percentage of a category} = \text{proportion of a category } \times 100.\]

Computing the Relative Frequency

Example: What are the proportions of b and g in the data listed below?