Linear Regression

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

March 31, 2025

Objectives

Introduce the simple linear regression
Develop an understanding of the residuals
Know how to inspect the linearity and correlation of two numerical variables
Activity: Examine a Linear Model

Previously…

Relationship Between Variables

\[\text{explanatory variable} \xrightarrow{\text{might affect}} \text{response variable}\]

Associated vs Independent Variables

When two variables show some connection with one another, they are called associated or dependent variables.
In general, association does not imply causation, and causation can only be inferred from a randomized experiment.

High School Graduation and Poverty

The scatterplot on the right shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

Response Variable (outcome): % in poverty
Explanatory Variable (predictor): % HS grad
Relationship: linear, negative, moderately strong

The Linear Model

A linear model is written as

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where $y$ is the outcome, $x$ is the predictor, $\beta_0$ is the intercept, and $\beta_1$ is the slope. The notation $\epsilon$ is the model’s error.

Notation:

Population Parameters: $\beta_0$ and $\beta_1$
Sample statistics (point estimates for the parameters): $b_0$ and $b_1$
Estimated/Predicted outcome: $\hat{y} = b_0 + b_1 x$

We can use the sample statistics $b_0$ and $b_1$ as point estimates to infer the true value of the population parameters $\beta_0$ and $\beta_1$.

Using a Linear Regression to Predict Poverty

The linear model for predicting poverty from high school graduation rate in the US is

\[ \widehat{poverty} = 64.78 - 0.62 \times HS_{grad} \]

where the sample statistics are the slope is $b_1 = - 0.62$ and the intercept is $b_0 = 64.78$.

The “hat” in the $\widehat{poverty}$ indicates an estimated/predicted outcome.

The high school graduate rate in Georgia is 85.1%.

What poverty level does the model predict for this state?

The poverty estimate/prediction for Georgia with graduate rate of 85.1% is \[ \widehat{poverty} = 64.78 - 0.62 \times 85.1 = 12.018 \]

Interpreting the Linear Model

The linear model for predicting poverty from high school graduation rate in the US is

\[ \widehat{poverty} = 64.78 - 0.62 \times HS_{grad} \]

where the sample statistics are the slope is $b_1 = - 0.62$ and the intercept is $b_0 = 64.78$.

Interpreting the slope: If the high school graduation rate increases by 1%, then the model predicts that the poverty rate will decrease by approximately 0.62%.
Interpreting the intercept: If the high school graduation rate is 0, then the model predicts that the poverty rate is approximately 64.78%.
It is necessary to understand - at least partially - the units in which the variables are measured in order to correctly interpret the slope and intercept.
It is good to understand data thoroughly and to understand the structure of the linear model.

Eyeballing the line

Which of the following appears to be the line that best fits the linear relationship between % in poverty and % HS grad? Choose one.

(a) because this line appears to be minimizing most of the distances between the data points and the line.
These distances from the linear model and the data points are called residuals.

Residuals

Residuals are the leftover variation in the data after accounting for the model fit: Data = Fit + Residual

A Residual of the $i^{th}$ observation $(x_i,y_i)$ is the difference between the observed ($y_i$) and estimated/predicted $\hat{y}_i$.

\[ \epsilon_i = y_i - \hat{y}_i \]

Residuals

Living in poverty in DC is 5.44% more than predicted.
Living in poverty in RI is 4.16% less than predicted.

Error/Residuals Terminologies

The error - denoted as $\epsilon$ in the general form of the linear model below - can refer to the deviation of the observed values (samples) from the true values in the population (often unobserved). \[ y = \beta_0 + \beta_1 x + e \]
The residual - which is also the model’s error - refers to the deviation from the estimated/predicted value and data (samples). Data = Fit + Residual
The difference between the observed ($y_i$) and estimated/predicted $\hat{y}_i$. \[ \epsilon_i = y_i - \hat{y}_i \]

Quantifying the Relationship with Correlation

The relationship of two numerical variables shown in the right is moderately strong linear negative relationship.

Correlation (notation: $r$) describes the strength of the linear association between two numerical variables.

It can have values between -1 (perfect negative) and +1 (perfect positive).
A value of 0 indicates no linear association.
Correlation has not units.

Quantifying the Relationship with Correlation

Example:

Which of the following is the best guess for the correlation between % in poverty and % HS grad?

(a)$r=0.6$
(b)$r=-0.75$
(c)$r=-0.1$
(d)$r=0.02$
(e)$r=-1.5$

(b) $r=-0.75$ because the association appears to be negative and the association seems to be strong.

Assessing the Correlation

Which of the following has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

(b) $\rightarrow$ correlation means linear association and - when fitting a linear model into data - we try minimize the residuals.

More Correlation Examples

Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.

More Correlation Examples

Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.

Activity: Examine a Linear Model

Make sure you have a copy of the M 3/31 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
Get together with another student.
Discuss your results.
Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/

Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/