Previously…

The Linear Model

A linear model is written as

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where $y$ is the outcome, $x$ is the predictor, $\beta_0$ is the intercept, and $\beta_1$ is the slope. The notation $\epsilon$ is the model’s error.

Notation:

Population Parameters: $\beta_0$ and $\beta_1$
Sample statistics (point estimates for the parameters): $b_0$ and $b_1$
Estimated/Predicted outcome: $\hat{y} = b_0 + b_1 x$

We can use the sample statistics $b_0$ and $b_1$ as point estimates to infer the true value of the population parameters $\beta_0$ and $\beta_1$.

Examples of Best Fit Linear Models on Data

Sample data with their best fitting lines (top row) and their corresponding residual plots (bottom row).

Idealized Examples of Residual Scatter Plots

$Image Source: [Model Validation: Interpreting Residual Plots Daniel Hocking in R bloggers](https://www.r-bloggers.com/2011/07/model-validation-interpreting-residual-plots/){target=_blank}$

Image Source: Model Validation: Interpreting Residual Plots Daniel Hocking in R bloggers

Terms:

Homoscedastic residuals $\longrightarrow$ constant variance.
Heteroscedastic residuals $\longrightarrow$ non-constant variance.

Ordinary Least Squares Assumptions

The data is randomly sampled independent observations.
The distribution of the residuals is normally distributed and the residuals exhibits homoscedasticity.
The independent variables are uncorrelated with the error term.
The regression model is linear in the coefficients and the error term.
The independent variables should not be collinear - or correlated to each other.

Sum of Squared Error (SSE) and the Total Sum of Squares (TSS)

The Sum of Squared Error (SSE) is a metric of left-over variability in the $y$ values if we know $x$.

\[ SSE = \sum_{i=1}^n (e_i)^2 \]

The Total Sum of Squares (SST) is a metric measure the variability in the $y$ values by how far they tend to fall from their mean, $\bar{y}$.

\[ SST = \sum_{i=1}^n (y_i - \bar{y})^2 \]

where $\bar{y} = \frac{1}{n} \sum_{i-1}^n y_i$, and $n$ is the number of observations.

Minimizing the SSE

To find the best linear fit, we minimize the SSE.

\[ \begin{aligned} SSE & = \sum_{i=1}^n (e_i)^2 \\ & = \sum_{i=1}^n (y_i - \hat{y_i})^2 \end{aligned} \]

Plugging-in the linear equation $\hat{y_i} = b_0 + b_1 x_i$, we have

\[ SSE = \sum_{i=1}^n (y_i - (b_0 + b_1 x_i)^2. \]

Minimizing the above equation over all possible values of $b_0$ and $b_1$ is a calculus problem. Take the derivative of SSE with respect to $b_1$, set it equal to zero, and solve for $b_1$.

Long story short,

\[ \begin{aligned} b_1 = & \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ b_0 = & \bar{y} - b_1 \bar{x} \end{aligned} \]

where $\bar{x}$ and $\bar{y}$ are the mean of $x$ and $y$ respectively.

The Estimated Slope and Intercept

We can rewrite the slope as follows

\[ \begin{aligned} b_1 & = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \\ & = \frac{\sqrt{\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2}}{\sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}} \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \\ b_1 & = \frac{s_y}{s_x} r \end{aligned} \]

where $s_y$ and $s_x$ are the standard deviations of $x$ and $y$ respectively, and $r$ is the pearson correlation coefficient of $x$ and $y$.

Least-Squares Example Visualization

$Least-Squares Example Visualization: Shown here is some data (orange dots) and the best fit linear model (red line). You can try this [least-squares regression interactive demo](https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_en.html){target=_blank} to visualize on how it works.$

Least-Squares Example Visualization: Shown here is some data (orange dots) and the best fit linear model (red line). You can try this least-squares regression interactive demo to visualize on how it works.

Finding the Best Fit

To find the best fit linear model to data, we compute the slope and intercept by using the correlation and standard deviations.

\[ \begin{aligned} \text{mean of x} \longrightarrow & \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \\ \text{mean of y} \longrightarrow & \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \\ \text{standard deviation of x} \longrightarrow & s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} \\ \text{standard deviation of y} \longrightarrow & s_y = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2} \\ \text{correlation of x and y} \longrightarrow & r = \frac{\sum_{i=1}^n{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} \\ \text{best fit slope} \longrightarrow & b_1 = \frac{s_y}{s_x} r \\ \text{best fit intercept} \longrightarrow & b_0 = \bar{y} - b_1 \bar{x} \end{aligned} \]

We typically use software –such as R– to compute the above values.

Interpreting the Slope Estimate

Using least-squares regression is typically the most helpful when dealing with data.
The slope in terms of the standard deviations and correlation helps us interpret it more precisely.
The math helps us understand more thoroughly of how linear regression works.
The ratio $\frac{s_y}{s_x}$ in the least squares slope $b_1$ tells us the average change in the predicted values of the response variable when the explanatory variable increases by 1 unit.

\[ b_1 = \frac{s_y}{s_x} r \]

Measuring the Strength of the Linear Fit

The coefficient of determination can then be calculated as

\[ R^2 = \frac{SST - SSE}{SST} = 1 - \frac{SSE}{SST} \]

where

\[ SSE = \sum_{i=1}^n (\epsilon_i)^2 \hspace{10px} \text{ and } \hspace{10px} SST = \sum_{i=1}^n (y_i - \bar{y})^2. \]

The range of $R^2$ is from 0 to 1. $R^2$ is the a measure of how well the linear regression fits the data.

Interpretation:

Value of 0 means 0% of the variance in the data is captured by the model. (bad and not a “good” fit)
Value of 1 means 100% of the variance in the data is captured by the model. (over-fitting and impacts the ability for the model to generalize about the population)
Ideally, you want an $R^2$ value close to 1.

In the case for a linear model with one predictor and one outcome, the relationship between the correlation and the coefficient of determination is $R^2 = r^2$.

Outliers in Linear Regression

Three plots, each with a least squares line and corresponding residual plot. Each dataset has at least one outlier.

A: There is one outlier far from the other points, though it only appears to slightly influence the line.
B: There is one outlier on the right, though it is quite close to the least squares line, which suggests it wasn’t very influential.
C: There is one point far away from the cloud, and this outlier appears to pull the least squares line up on the right; examine how the line around the primary cloud doesn’t appear to fit very well.
Note that the residual plots here are the predicted y vs residuals instead of the explanatory variables vs residuals. These are two different ways to look at residuals for evaluating a linear model.

Outliers in Linear Regression

Types of outliers.

A point (or a group of points) that stands out from the rest of the data is called an outlier.
Outliers that fall horizontally away from the center of the cloud of points are called leverage points.
Outliers that influence on the slope of the line are called influential points.

We must be cautious on removing outliers in our modeling. Sometimes outliers are interesting cases that might be worth investigating and it might even make a model much better.

Try out this least-squares regression interactive demo to play around with outliers in least squares regression.

Activity: Assessing Residuals of a Linear Model

Make sure you have a copy of the W 4/2 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
Get together with another student.
Discuss your results.
Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/

Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/

Residual Analysis

Applied Statistics

Objectives

Previously…

Examples of Best Fit Linear Models on Data

Idealized Examples of Residual Scatter Plots

Ordinary Least Squares Assumptions

Sum of Squared Error (SSE) and the Total Sum of Squares (TSS)

Minimizing the SSE

The Estimated Slope and Intercept

Least-Squares Example Visualization

Finding the Best Fit

Interpreting the Slope Estimate

Measuring the Strength of the Linear Fit

Outliers in Linear Regression

Outliers in Linear Regression

Activity: Assessing Residuals of a Linear Model

References