MTH-361A | Spring 2025 | University of Portland
March 14, 2025
The guiding principle of statistics is statistical thinking.
Statistical Thinking in the Data Science Life Cycle
Types of Inference
Parameter Estimation | Hypothesis Testing | |
---|---|---|
Goal | Estimate an unknown population value | Assess claims about a population value |
Methods | Point Estimation: A single value estimate (e.g., sample
mean) Interval Estimation: A range of plausible values (e.g., confidence interval) |
State a null and an alternative hypothesis Compute a test statistic and compare it to a threshold (p-value or critical value) |
Key Concept | Focuses on precision in estimation (confidence intervals) | Focuses on decision-making based on evidence (reject or fail to reject the null hypothesis) |
The standard normal distribution is when \(\mu=0\) and \(s=1\) or \(Z \sim \text{N}(0,1)\).
The transformation formula (the z-score)
Standardized scores that measure how many standard deviations a value is from the mean. \[Z = \frac{X - \mu}{\sigma}\]
The standard normal distribution, \(Z \sim \text{N}(0,1)\).
A plausible range of values for the population parameter is called a confidence interval.
Analogy
\(\star\) Key Idea: If we report a point estimate, we probably won’t hit the exact population parameter. If we report a range of plausible values we have a good shot at capturing the parameter.
Facebook’s categorization of user interests
Most commercial websites (e.g. social media platforms, news out- lets, online retailers) collect a data about their users’ behaviors and use these data to deliver targeted content, recommendations, and ads.
To understand whether Americans think their lives line up with how the algorithm-driven classification systems categorizes them, Pew Research asked a representative sample of 850 American Facebook users how accurately they feel the list of categories Facebook has listed for them on the page of their supposed interests actually represents them and their interests. 67% of the respondents said that the listed categories were accurate.
Estimate the true proportion of American Facebook users who think the Facebook categorizes their interests accurately.
The goal of parameter estimation is to find a range of possible values (confidence interval).
Given information
The Confidence Interval
We want to find the 95% confidence interval using the formula: \[\text{point estimate} \pm 1.96 \times \text{SE}\] where SE is the standard error.
This can be written as \[ \begin{aligned} 0.67 & \pm 1.96 \times \sqrt{\frac{0.67 (1-0.67)}{850}} \\ 0.67 & \pm 1.96 \times 0.0161 \\ & \longrightarrow (0.67-0.0316,0.67+0.0316) \\ & \longrightarrow (0.6384,0.7016) \end{aligned} \]
Thus, the 95% interval for estimating the true \(p\) is between 0.6384 and 0.7016.
Which of the following is the correct interpretation of this confidence interval? We are 95% confident that:
\(\star\) 64% to 67% of all American Facebook users think Facebook categorizes their interests accurately.
\(\dagger\) Why do we interpret the confidence interval this way?
Suppose we took many samples and built a confidence interval from each sample using the equation \[\text{point estimate} \pm 1.96 \times \text{standard error}.\]
Then about 95% of those intervals would contain the true population proportion (\(p\)).
If we want to be more certain that we capture the population parameter, i.e. increase our confidence level, should we use a wider interval or a smaller interval?
\(\star\) A wider interval.
Can you see any drawbacks to using a wider interval?
\(\star\) If the interval is too wide it may not be very informative.
\[\text{point estimate} \pm z^{\star} \times \text{SE}.\]
Which of the below Z scores is the appropriate \(z^{\star}\) when calculating a 99.7% confidence interval?
\(\star\) Estimating the \(z^{\star}\) can be done using the 68-95-99.7 rule. We know that \(P(-3 \le Z \le 3) \approx 0.997\). So, the closest answer is \(Z = 2.97\).
.pdf
file.