MTH-161D | Spring 2025 | University of Portland
January 24, 2025
The guiding principle of statistics is statistical thinking.
Statistical Thinking in the Data Science Life Cycle
Basic Parts of R Studio
Types of Variables
Numerical variables are quantitative variables that represent measurable amounts or quantities.
Discrete
Continuous
Categorical variables are qualitative variables that represent labels or categories. They describe characteristics and are not inherently numerical.
Nominal
Ordinal
An experiment is evaluating the effectiveness of a new drug in treating migraines. A group variable is used to indicate the experiment group for each patient: treatment or control. The number of migraines variable represents the number of migraines the patient experienced during a 3-month period. Classify each variable as either numerical or categorical.
Variables:
How are variables created and stored in R?
\(\dagger\) Create a variable named
ria
and initially define it as a number. Then, redefine
ria
as a vector. What happens to the original value of
ria
(the number) after it is redefined as a vector?
Descriptive statistics involves organizing, summarizing, and presenting data in an informative way. It Focuses on describing and understanding the main features of a dataset.
For Numerical Variables
For Categorical Variables
The sample mean, denoted as \(\overline{x}\), can be calculated as \[\overline{x} = \frac{x_1 + x_2 + \cdots + x_n}{n}\] where \(x_1, x_2, \cdots, x_n\) represent the \(n\) observed values.
In other words, the mean (or average) is the sum of all data points divided by the number of points: \[\text{mean} = \frac{\text{sum of all data points}}{\text{number of data points}}.\]
Example: What is the mean of the data set \(7,1,2,4,6,3,2,7\)?
\[ \begin{aligned} \overline{x} & = \frac{7+1+2+4+6+3+2+7}{8} \\ & = 4 \end{aligned} \]
So, the mean is \(4\).
The median is the middle value when the data is sorted.
The median is computed using the following cases:
Example: What is the median of the data set \(7,1,2,4,6,3,2,7\)?
The number of data points is \(8\), an even number. \[\text{sorted data} \longrightarrow 1,2,2,\color{blue}{\mathbf{3}},\color{blue}{\mathbf{4}},6,7,7\]
\[ \begin{aligned} \text{median} & = \frac{\text{sum of two middle values}}{2} \\ & = \frac{\color{blue}{\mathbf{3}}+\color{blue}{\mathbf{4}}}{2} \\ & = 3.5 \end{aligned} \]
## [1] 1 2 2 3 4 6 7 7
## [1] 3.5
So, the median is \(3.5\).
The frequency is the number of observations in each category.
The method of computing the frequencies of a categorical variable is as follows:
Example: How many b and g are there in the data listed below?
\[b, g, g, b, b, g, b, b\]
The unique categories are b and g.
Number of occurrences: \[ \begin{aligned} b,b,b,b,b & \longrightarrow 5 \\ g,g,g & \longrightarrow 3 \end{aligned} \]
So, there are \(5\) b and \(3\) g.
The relative frequency is the proportion of observations in each category.
The proportion is computed using the formula \[\text{proportion of a category} = \frac{\text{number of cases of a category}}{\text{total number of cases}}.\]
The percentage is the relative frequency multiplied by 100: \[\text{percentage of a category} = \text{proportion of a category } \times 100.\]
Example: What are the proportions of b and g in the data listed below?
\[b, g, g, b, b, g, b, b\]
The number of occurrences are the same as the previous example.
Number of cases: \(8\)
Proportions: \[ \begin{aligned} b & \longrightarrow \frac{5}{8} = 0.625 \\ g & \longrightarrow \frac{3}{8} = 0.375 \end{aligned} \]
## cat_data
## b g
## 0.625 0.375
So, there are \(0.625\) b and \(0.375\) g, or \(62.5\)% b and \(37.5\)% g.
.pdf
file.