Inference for Linear Regression

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

April 7, 2025

Objectives

Previously… (1/2)

Linear Regression

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Previously… (2/2)

Confidence Interval

\[\text{point estimate} \pm z^* SE\]

where \(SE\) is the standard error and \(z^*\) is the critical z-value for a given confidence level assuming a standard normal distribution. Use \(t_{df}^*\) for assuming a t-distribution given degrees of freedom \(df\).

Case Study I

Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars.

Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars.

There are 1000 stores in our sample. Suppose that we know the population slope \(\beta_1 = 4.7\) and intercept \(\beta_0 = 12\).

Case Study I: The Linear Model

The population model is: \[y_{revenue} = \beta_0 + \beta_1 x_{advertising} + e\] where \(y\) is the response, \(x\) is the predictor, \(\beta_0\) is the intercept, \(\beta_1\) is the slope, and \(e\) is the error term.

The least squares regression model uses the data to find a best linear fit: \[\hat{y}_{revenue} = b_0 + b_1 x_{advertising}.\] where \(b_0 = 11.23\), \(b_1 = 4.85\). There will be our point estimates.

Case Study I: Resdiual Analysis (1/2)

Case Study I: Resdiual Analysis (2/2)

A Q-Q plot (quantile-quantile plot) is a graphical tool used to assess whether a dataset follows a particular theoretical distribution—most commonly, the normal distribution.

Case Study I: Variability of the Statistic (1/4)

A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

Case Study I: Variability of the Statistic (2/4)

A second sample of size 20 also shows a positive trend!

A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

Case Study I: Variability of the Statistic (3/4)

The linear models from the two different random samples are quite similar, but they are not the same line.

The linear models from the two different random samples are quite similar, but they are not the same line.

Case Study I: Variability of the Statistic (4/4)

If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

Case Study I: Sampling Distribution of the Slope Estimate

Variability of slope estimates taken from many different samples of stores, each of size 20.

Variability of slope estimates taken from many different samples of stores, each of size 20.

Case Study I: Confidence Interval

The 95% Confidence interval for the slope and intercept using the bootstrapped samples. Here, we are using the percentiles of the bootstrapped samples to estimate the confidence interval.

term estimate lower bound upper bound
\(b_0\) 11.3215 9.5334 12.9665
\(b_1\) 4.8254 4.4023 5.2861

Case Study I: Best Fit Line and Confidence Interval

Using the \(t\)-distribution (1/2)

An alternative to the bootstrapping method is using the \(t\)-distribution to determine the confidence interval at a given confidence level.

The confidence intervals for the slope \(b_1\) and intercept \(b_0\) estimates is given by

\[ \begin{aligned} \text{slope } \longrightarrow b_1 \pm t_{df}^* \text{SE}_{b_1} \\ \text{intercept } \longrightarrow b_0 \pm t_{df}^* \text{SE}_{b_0} \end{aligned} \]

where \(\text{SE}_{b_1}\) and \(\text{SE}_{b_0}\) are the standard errors for the slope and intercept sampling distributions respectively. The critical \(t_{df}^*\) is computed using the \(t\)-distribution given a confidence level and degrees of freedom, \(df = n - 1\), where \(n\) is the sample size.

Using the \(t\)-distribution (2/2)

Computing the standard errors \(\text{SE}_{b_1}\) and \(\text{SE}_{b_0}\) are fairly complicated and it is usually computed via functions or bootstrapping simulations using software –such as R.

However, the standard errors are given by

\[ \begin{aligned} \text{SE}_{b_1} & = \frac{1}{s_x} \times \sqrt{\frac{s^2}{n}} \\ \text{SE}_{b_0} & = \sqrt{1 + \frac{\bar{x}}{s_x}} \times \sqrt{\frac{s^2}{n}} \end{aligned} \]

where:

Case Study I: Confidence Intervals using the \(t\)-Distribution (1/2)

The summary statistics of our explanatory and response variables are given below, which includes the correlation coefficient.

statistics estimates
\(\bar{x}\) 4.0137
\(\bar{y}\) 30.7142
\(s_x\) 0.9909
\(s_y\) 9.2371
\(r\) 0.5207
term estimate SE
\(b_0\) 11.2334 1.0415
\(b_1\) 4.8536 0.2519

The best fit slope and intercept of the linear regression line is

Case Study I: Confidence Intervals using the \(t\)-Distribution (2/2)

The goal is to determine the 95% confidence interval of the slope and intercept estimates using the \(t\)-distribution approach.

\[ \begin{aligned} b_1 & \pm t_{df}^* \times \text{SE}_{b_1} \\ 4.8536 & \pm 1.9623 \times 0.2519 \end{aligned} \] \[(4.3593,5.3479)\]

\[ \begin{aligned} b_0 & \pm t_{df}^* \times \text{SE}_{b_0} \\ 11.2334 & \pm 1.9623 \times 1.0415 \end{aligned} \] \[(9.1897,13.2771)\]

The intervals computed using the \(t\)-distribution is assuming the CLT conditions and the results here are are approximately equal to the intervals computed using the bootstrapping method.

Activity: Determine Confidence Intervals of a Linear Model

  1. Log-in to Posit Cloud and open the R Studio assignment M 4/7 - Determine Confidence Intervals of Linear Model.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/