Data Basics &
Descriptive Statistics

Elementary Statistics

MTH-161D | Spring 2025 | University of Portland

January 24, 2025

Objectives

Previously… (1/2)

The guiding principle of statistics is statistical thinking.

Statistical Thinking in the Data Science Life Cycle

Statistical Thinking in the Data Science Life Cycle

Previously… (2/2)

Basic Parts of R Studio

Basic Parts of R Studio

Types of Variables

Types of Variables

Types of Variables

Numerical: Discrete vs Continuous

Numerical variables are quantitative variables that represent measurable amounts or quantities.

Discrete

Continuous

Categorical: Nominal vs Ordinal

Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.

Nominal

Ordinal

Case Study 1

An experiment is evaluating the effectiveness of a new drug in treating migraines. A group variable is used to indicate the experiment group for each patient: treatment or control. The number of migraines variable represents the number of migraines the patient experienced during a 3-month period. Classify each variable as either numerical or categorical.

Variables:

R Variables

How are variables created and stored in R?

\(\dagger\) Create a variable named ria and initially define it as a number. Then, redefine ria as a vector. What happens to the original value of ria (the number) after it is redefined as a vector?

Descriptive Statistics

Descriptive statistics involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.

For Numerical Variables

For Categorical Variables

The Mean

The sample mean, denoted as \(\overline{x}\), can be calculated as \[\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where \(x_1, x_2, \cdots, x_n\) represent the \(n\) observed values.

In other words, the mean (or average) is the sum of all data points divided by the number of points: \[\text{mean} = \frac{\text{sum of all data points}}{\text{number of data points}}.\]

Computing the Mean

Example: What is the mean of the data set \(7,1,2,4,6,3,2,7\)?

\[ \begin{aligned} \overline{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]

num_data <- c(7,1,2,4,6,3,2,7)
mean(num_data)
## [1] 4

So, the mean is \(4\).

The Median

The median is the middle value when the data is sorted.

The median is computed using the following cases:

Computing the Median

Example: What is the median of the data set \(7,1,2,4,6,3,2,7\)?

The number of data points is \(8\), an even number. \[\text{sorted data} \longrightarrow 1,2,2,\color{blue}{\mathbf{3}},\color{blue}{\mathbf{4}},6,7,7\]

\[ \begin{aligned} \text{median} & = \frac{\text{sum of two middle values}}{2} \\ & = \frac{\color{blue}{\mathbf{3}}+\color{blue}{\mathbf{4}}}{2} \\ & = 3.5 \end{aligned} \]

num_data <- c(7,1,2,4,6,3,2,7)
sort(num_data)
## [1] 1 2 2 3 4 6 7 7
median(num_data)
## [1] 3.5

So, the median is \(3.5\).

The Frequency

The frequency is the number of observations in each category.

The method of computing the frequencies of a categorical variable is as follows:

Computing the Frequency

Example: How many b and g are there in the data listed below?

\[b, g, g, b, b, g, b, b\]

The unique categories are b and g.

Number of occurrences: \[ \begin{aligned} b,b,b,b,b & \longrightarrow 5 \\ g,g,g & \longrightarrow 3 \end{aligned} \]

cat_data <- c("b","g","g","b","b","g","b","b")
table(cat_data)
## cat_data
## b g 
## 5 3

So, there are \(5\) b and \(3\) g.

The Relative Frequency and Percentage

The relative frequency is the proportion of observations in each category.

The proportion is computed using the formula \[\text{proportion of a category} = \frac{\text{number of cases of a category}}{\text{total number of cases}}.\]

The percentage is the relative frequency multiplied by 100: \[\text{percentage of a category} = \text{proportion of a category } \times 100.\]

Computing the Relative Frequency

Example: What are the proportions of b and g in the data listed below?

\[b, g, g, b, b, g, b, b\]

The number of occurrences are the same as the previous example.

Number of cases: \(8\)

Proportions: \[ \begin{aligned} b & \longrightarrow \frac{5}{8} = 0.625 \\ g & \longrightarrow \frac{3}{8} = 0.375 \end{aligned} \]

cat_data <- c("b","g","g","b","b","g","b","b")
table(cat_data)/length(cat_data)
## cat_data
##     b     g 
## 0.625 0.375

So, there are \(0.625\) b and \(0.375\) g, or \(62.5\)% b and \(37.5\)% g.

Activity: Identify types of variables & compute descriptive statistics

  1. Make sure you have a copy of the F 1/24 Worksheet. This will be handed out physically and it is also digitally available on Moodle. Then, access your Calculator [First name] [First letter of last name] in Posit Cloud.
  2. Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
  3. Get together with another student.
  4. Discuss your results.
  5. Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/