MTH-161D | Spring 2025 | University of Portland
March 12, 2025
These slides are derived from Diez et al. (2012).
The guiding principle of statistics is statistical thinking.
Statistical Thinking in the Data Science Life Cycle
Types of Inference
Parameter Estimation | Hypothesis Testing | |
---|---|---|
Goal | Estimate an unknown population value | Assess claims about a population value |
Methods | Point Estimation: A single value estimate (e.g., sample
mean) Interval Estimation: A range of plausible values (e.g., confidence interval) |
State a null and an alternative hypothesis Compute a test statistic and compare it to a threshold (p-value or critical value) |
Key Concept | Focuses on precision in estimation (confidence intervals) | Focuses on decision-making based on evidence (reject or fail to reject the null hypothesis) |
The normal r.v. \(X \sim \text{N}(\mu,\sigma^2)\) has infinite possible outcomes (or infinite sized sample space) where \(\mu\) is the mean and \(\sigma^2\) is the variance (\(\sigma\) is the standard deviation) with PDF given the continuous curve below.
Key idea the Central Limit Theorem (CLT). Image source: Medium–AI/Data Science Digest
\(\star\) Key Idea: CLT says that the sample mean (or sum) of many independent and identically distributed random variables approaches a normal distribution, regardless of the original distribution.
If we randomly sample 1,000 adults from each U.S. state, would the sample means of their heights be:
\(\star\) The answer is not the same, but only somewhat different because of sampling variability.
Suppose the proportion of American adults who support the expansion of solar energy is \(p = 0.88\).
\(\star\) \(p=0.88\) is a population parameter because it is talking about all american adults. The proportion is considered a high proportion of support. Thus, a randomly selected american adult is more likely to support solar energy expansion.
Suppose that you don’t have access to the population of all American adults, which is a quite likely scenario. In order to estimate the proportion of American adults who support solar power expansion, you might sample from the population and use your sample proportion as the best guess for the unknown population proportion.
\(\star\) Key Idea: After many repeated sampling of the same process as described, the resulting distribution of proportions will be normal.
\(\dagger\) Based on this distribution, what do you think is the true population proportion?
Sampling distributions are never observed
\(\star\) Key Idea: Understanding the sampling distribution will help us characterize and make sense of the point estimates that we do observe.
The normal r.v. \(X \sim \text{N}(\mu,\sigma^2)\) has infinite possible outcomes (or infinite sized sample space) where \(\mu\) is the mean and \(\sigma^2\) is the variance (\(\sigma\) is the standard deviation) with PDF given the continuous curve below.
1st standard deviation from the mean
\[P(\mu - \sigma \le X \le \mu + \sigma) \approx 0.68\]
2nd standard deviation from the mean
\[P(\mu - 2\sigma \le X \le \mu + 2\sigma) \approx 0.95\]
3rd standard deviation from the mean
\[P(\mu - 3\sigma \le X \le \mu + 3\sigma) \approx 0.997\]
The Normal PDF satisfies the probability axioms
\[P(\mu - \infty \le X \le \mu + \infty) \approx 1\]
\(\star\) Key Idea: Because of the axiom that the sum of the probabilities for all outcomes in the sample space is equal to 1, the total area under the Normal PDF is always 1.
The standard normal distribution is when \(\mu=0\) and \(s=1\) or \(Z \sim \text{N}(0,1)\).
The transformation formula (the z-score)
Standardized scores that measure how many standard deviations a value is from the mean. \[Z = \frac{X - \mu}{\sigma}\]
The standard normal distribution, \(Z \sim \text{N}(0,1)\).
\(\star\) Key Idea: The standard normal distribution is that it is a normal distribution with a mean of 0 and a standard deviation of 1. It serves as a reference distribution, allowing any normally distributed variable to be standardized.
Example:
Take a random sample of students at a college and ask them how many extracurricular activities they are involved in to estimate the average number (or median number) of extra curricular activities all students in this college are interested in.
\(\star\) Key Idea: The principles and general ideas of CLT apply to other parameters as well, even if the details change a little.
.pdf
file.