MTH-161D | Spring 2025 | University of Portland
April 14, 2025
Linear Regression
\[ y = \beta_0 + \beta_1 x + \epsilon \]
Confidence Interval
\[\text{point estimate} \pm z^* SE\]
where \(SE\) is the standard error and \(z^*\) is the critical z-value for a given confidence level assuming a standard normal distribution. Use \(t_{df}^*\) for assuming a t-distribution given degrees of freedom \(df\).
Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars.
There are 1000 stores in our sample. Suppose that we know the population slope \(\beta_1 = 4.7\) and intercept \(\beta_0 = 12\).
The population model is: \[y_{revenue} = \beta_0 + \beta_1 x_{advertising} + e\] where \(y\) is the response, \(x\) is the predictor, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(e\) is the error term.
The least squares regression model uses the data to find a best linear fit: \[\hat{y}_{revenue} = b_0 + b_1 x_{advertising}.\] where \(b_0 = 11.23\), \(b_1 = 4.85\). There will be our point estimates.
A Q-Q plot (quantile-quantile plot) is a graphical tool used to assess whether a dataset follows a particular theoretical distribution—most commonly, the normal distribution.
A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.
A second sample of size 20 also shows a positive trend!
A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.
The linear models from the two different random samples are quite similar, but they are not the same line.
If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.
Variability of slope estimates taken from many different samples of stores, each of size 20.
The 95% Confidence interval for the slope and intercept using the bootstrapped samples. Here, we are using the percentiles of the bootstrapped samples to estimate the confidence interval.
term | estimate | lower bound | upper bound |
---|---|---|---|
\(b_0\) | 11.3215 | 9.5334 | 12.9665 |
\(b_1\) | 4.8254 | 4.4023 | 5.2861 |
An alternative to the bootstrapping method is using the \(t\)-distribution to determine the confidence interval at a given confidence level.
The confidence intervals for the slope \(b_1\) and intercept \(b_1\) estimates is given by
\[ \begin{aligned} \text{slope } \longrightarrow b_1 \pm t_{df}^* \text{SE}_{b_1} \\ \text{intercept } \longrightarrow b_0 \pm t_{df}^* \text{SE}_{b_0} \end{aligned} \]
where \(\text{SE}_{b_1}\) and \(\text{SE}_{b_0}\) are the standard errors for the slope and intercept sampling distributions respectively. The critical \(t_{df}^*\) is computed using the \(t\)-distribution given a confidence level and degrees of freedom, \(df = n - 1\), where \(n\) is the sample size.
Computing the standard errors \(\text{SE}_{b_1}\) and \(\text{SE}_{b_0}\) are fairly complicated and it is usually computed via functions or bootstrapping simulations using software –such as R.
However, the standard errors are given by
\[ \begin{aligned} \text{SE}_{b_1} & = \frac{1}{s_x} \times \sqrt{\frac{s^2}{n}} \\ \text{SE}_{b_0} & = \sqrt{1 + \frac{\bar{x}}{s_x}} \times \sqrt{\frac{s^2}{n}} \end{aligned} \]
where:
The summary statistics of our explanatory and response variables are given below, which includes the correlation coefficient.
statistics | estimates |
---|---|
\(\bar{x}\) | 4.0137 |
\(\bar{y}\) | 30.7142 |
\(s_x\) | 0.9909 |
\(s_y\) | 9.2371 |
\(r\) | 0.5207 |
term | estimate | SE |
---|---|---|
\(b_0\) | 11.2334 | 1.0415 |
\(b_1\) | 4.8536 | 0.2519 |
The best fit slope and intercept of the linear regression line is
The goal is to determine the 95% confidence interval of the slope and intercept estimates using the \(t\)-distribution approach.
\[ \begin{aligned} b_1 & \pm t_{df}^* \times \text{SE}_{b_1} \\ 4.8536 & \pm 1.9623 \times 0.2519 \end{aligned} \] \[(4.3593,5.3479)\]
\[ \begin{aligned} b_0 & \pm t_{df}^* \times \text{SE}_{b_0} \\ 11.2334 & \pm 1.9623 \times 1.0415 \end{aligned} \] \[(9.1897,13.2771)\]
The intervals computed using the \(t\)-distribution is assuming the CLT conditions and the results here are are approximately equal to the intervals computed using the bootstrapping method.
.pdf
file.