Objectives

Introduce inference for linear regression
Develop an understanding of the sampling distribution of the slope and intercept
Know how to compute the confidence intervals of the slope and intercept
Activity: Determine Confidence Intervals of a Linear Model

Previously… (1/2)

Linear Regression

\[ y = \beta_0 + \beta_1 x + \epsilon \]

$y$ is the outcome, $x$ is the predictor, $\beta_0$ is the intercept, and $\beta_1$ is the slope. The notation $\epsilon$ is the model’s error (residuals).
We estimate the slope $b_1$ and intercept $b_0$ of the least squares regression line by minimizing the sum of squared residuals.
The slope is given by $b_1 = \frac{s_y}{s_x} r$ , and the intercept is $b_0 = \bar{y} - b_1 \bar{x}$ , where $r$ is the correlation coefficient, $s_x$ and $s_y$ are the standard deviations of $x$ and $y$, and $\bar{x}$, $\bar{y}$ are their respective means.

Previously… (2/2)

Confidence Interval

\[\text{point estimate} \pm z^* SE\]

where $SE$ is the standard error and $z^*$ is the critical z-value for a given confidence level assuming a standard normal distribution. Use $t_{df}^*$ for assuming a t-distribution given degrees of freedom $df$.

Case Study I

Revenue as a linear model of advertising dollars for a population of sandwich stores, in thousands of dollars.

There are 1000 stores in our sample. Suppose that we know the population slope $\beta_1 = 4.7$ and intercept $\beta_0 = 12$.

Case Study I: The Linear Model

The population model is: \[y_{revenue} = \beta_0 + \beta_1 x_{advertising} + e\] where $y$ is the response, $x$ is the predictor, $\beta_0$ is the intercept, $\beta_1$ is the slope, and $e$ is the error term.

The least squares regression model uses the data to find a best linear fit: \[\hat{y}_{revenue} = b_0 + b_1 x_{advertising}.\] where $b_0 = 11.23$, $b_1 = 4.85$. There will be our point estimates.

Case Study I: Resdiual Analysis (1/2)

The residuals appear approximately normal, supporting the normality assumption.
The residual plot shows roughly constant variance, suggesting homoskedasticity.
Slight clustering near the center may require further investigation but otherwise the residuals are independent of the model (no pattern).

Case Study I: Resdiual Analysis (2/2)

A Q-Q plot (quantile-quantile plot) is a graphical tool used to assess whether a dataset follows a particular theoretical distribution—most commonly, the normal distribution.

In regression, we often use a Q-Q plot of residuals to check if they are normally distributed –one of the important assumptions for valid inference for linear regression.
Quick interpretation:
- Straight line means normality is a reasonable assumption.
- Curved or S-shaped means deviations from normality.
- Outliers far from the line means possible extreme values.

Case Study I: Variability of the Statistic (1/4)

A random sample of 20 stores from the entire population. A linear trend between advertising and revenue continues to be observed.

Case Study I: Variability of the Statistic (2/4)

A second sample of size 20 also shows a positive trend!

A different random sample of 20 stores from the entire population. Again, a linear trend between advertising and revenue is observed.

Case Study I: Variability of the Statistic (3/4)

The linear models from the two different random samples are quite similar, but they are not the same line.

Case Study I: Variability of the Statistic (4/4)

If repeated samples of size 20 are taken from the entire population, each linear model will be slightly different. The red line provides the linear fit to the entire population.

Case Study I: Sampling Distribution of the Slope Estimate

Variability of slope estimates taken from many different samples of stores, each of size 20.

Case Study I: Confidence Interval

The 95% Confidence interval for the slope and intercept using the bootstrapped samples. Here, we are using the percentiles of the bootstrapped samples to estimate the confidence interval.

term	estimate	lower bound	upper bound
$b_0$	11.3215	9.5334	12.9665
$b_1$	4.8254	4.4023	5.2861

Case Study I: Best Fit Line and Confidence Interval

We can conclude that we are 95% confident that the true slope that models the association between spending amount on advertising and revenue is $4.40K to $5.30K.
We are also 95% confident that the true intercept is in between $9.47K and $13.22K revenue if spending amount is $0K.
Note that the slope of 0 –which means no association between the variables– is not within the confidence interval, suggesting a statistically significant relationship.

Using the $t$-distribution (1/2)

An alternative to the bootstrapping method is using the $t$-distribution to determine the confidence interval at a given confidence level.

The confidence intervals for the slope $b_1$ and intercept $b_0$ estimates is given by

\[ \begin{aligned} \text{slope } \longrightarrow b_1 \pm t_{df}^* \text{SE}_{b_1} \\ \text{intercept } \longrightarrow b_0 \pm t_{df}^* \text{SE}_{b_0} \end{aligned} \]

where $\text{SE}_{b_1}$ and $\text{SE}_{b_0}$ are the standard errors for the slope and intercept sampling distributions respectively. The critical $t_{df}^*$ is computed using the $t$-distribution given a confidence level and degrees of freedom, $df = n - 1$, where $n$ is the sample size.

Using the $t$-distribution (2/2)

Computing the standard errors $\text{SE}_{b_1}$ and $\text{SE}_{b_0}$ are fairly complicated and it is usually computed via functions or bootstrapping simulations using software –such as R.

However, the standard errors are given by

\[ \begin{aligned} \text{SE}_{b_1} & = \frac{1}{s_x} \times \sqrt{\frac{s^2}{n}} \\ \text{SE}_{b_0} & = \sqrt{1 + \frac{\bar{x}}{s_x}} \times \sqrt{\frac{s^2}{n}} \end{aligned} \]

where:

$s$ is the standard error of the regression, $s = s_{\epsilon} \sqrt{\frac{n-1}{n-2}}$, where $s_{\epsilon}$ is the standard deviation of the residual errors
$\bar{x}$ is the mean of $x$, the explanatory variable
$s_x$ is the standard deviation of $x$
$n$ is the sample size

Case Study I: Confidence Intervals using the $t$-Distribution (1/2)

The summary statistics of our explanatory and response variables are given below, which includes the correlation coefficient.

statistics	estimates
$\bar{x}$	4.0137
$\bar{y}$	30.7142
$s_x$	0.9909
$s_y$	9.2371
$r$	0.5207

term	estimate	SE
$b_0$	11.2334	1.0415
$b_1$	4.8536	0.2519

The best fit slope and intercept of the linear regression line is

$b_0 = \bar{y} - b_1 \bar{x} = 30.7142 - 4.853 (4.0137) \approx 11.23$
$b_1 = \frac{s_y}{s_x} r = \left( \frac{9.2371}{0.9909} \right) 0.5207 \approx 4.85$

The sample size is $n = 1000$.
The standard deviation of the residuals is $s_{\epsilon} = 7.8863$.
The standard error of the regression is $s = s_{\epsilon} \sqrt{\frac{n-1}{n-2}} = 7.8863 \sqrt{\frac{1000-1}{1000-2}} \approx 7.8903$.
Standard errors of the slope and intercept estimates:
- $\text{SE}_{b_0} = \sqrt{1 + \frac{\bar{x}}{s_x}} \times \sqrt{\frac{s^2}{n}} = \sqrt{1 + \frac{4.0137}{0.9909}} \times \sqrt{\frac{7.8903^2}{1000}} \approx 1.0415$
- $\text{SE}_{b_1} = \frac{1}{s_x} \times \sqrt{\frac{s^2}{n}} = \frac{1}{0.9909} \times \sqrt{\frac{7.8903^2}{1000}} = \approx 0.2519$

Case Study I: Confidence Intervals using the $t$-Distribution (2/2)

The goal is to determine the 95% confidence interval of the slope and intercept estimates using the $t$-distribution approach.

The degrees of freedom is $df = n - 1 = 1000 - 1 = 999$
For a $0.95$ confidence level $t_{df}^* = t_{999}^* \approx 1.9623$
95% confidence interval for the slope estimate:

\[ \begin{aligned} b_1 & \pm t_{df}^* \times \text{SE}_{b_1} \\ 4.8536 & \pm 1.9623 \times 0.2519 \end{aligned} \] \[(4.3593,5.3479)\]

95% confidence interval for the intercept estimate:

\[ \begin{aligned} b_0 & \pm t_{df}^* \times \text{SE}_{b_0} \\ 11.2334 & \pm 1.9623 \times 1.0415 \end{aligned} \] \[(9.1897,13.2771)\]

The intervals computed using the $t$-distribution is assuming the CLT conditions and the results here are are approximately equal to the intervals computed using the bootstrapping method.

Activity: Determine Confidence Intervals of a Linear Model

Log-in to Posit Cloud and open the R Studio assignment M 4/7 - Determine Confidence Intervals of Linear Model.
Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
Change the author in the YAML header.
Read the provided instructions.
Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/

Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/

Inference for Linear Regression

Applied Statistics