Linear Regression

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

March 31, 2025

Objectives

Previously…

Relationship Between Variables

\[\text{explanatory variable} \xrightarrow{\text{might affect}} \text{response variable}\]

Associated vs Independent Variables

High School Graduation and Poverty

The scatterplot on the right shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).

The Linear Model

A linear model is written as

\[ y = \beta_0 + \beta_1 x + \epsilon \]

where \(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. The notation \(\epsilon\) is the model’s error.

Notation:

We can use the sample statistics \(b_0\) and \(b_1\) as point estimates to infer the true value of the population parameters \(\beta_0\) and \(\beta_1\).

Using a Linear Regression to Predict Poverty

The linear model for predicting poverty from high school graduation rate in the US is

\[ \widehat{poverty} = 64.78 - 0.62 \times HS_{grad} \]

where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).

The “hat” in the \(\widehat{poverty}\) indicates an estimated/predicted outcome.

The high school graduate rate in Georgia is 85.1%.

What poverty level does the model predict for this state?

Interpreting the Linear Model

The linear model for predicting poverty from high school graduation rate in the US is

\[ \widehat{poverty} = 64.78 - 0.62 \times HS_{grad} \]

where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).

Eyeballing the line

Which of the following appears to be the line that best fits the linear relationship between % in poverty and % HS grad? Choose one.

Residuals

Residuals are the leftover variation in the data after accounting for the model fit:
Data = Fit + Residual

A Residual of the \(i^{th}\) observation \((x_i,y_i)\) is the difference between the observed (\(y_i\)) and estimated/predicted \(\hat{y}_i\).

\[ \epsilon_i = y_i - \hat{y}_i \]

Residuals

Error/Residuals Terminologies

Quantifying the Relationship with Correlation

The relationship of two numerical variables shown in the right is moderately strong linear negative relationship.

Correlation (notation: \(r\)) describes the strength of the linear association between two numerical variables.

Quantifying the Relationship with Correlation

Example:

Which of the following is the best guess for the correlation between % in poverty and % HS grad?

(a)\(r=0.6\)
(b)\(r=-0.75\)
(c)\(r=-0.1\)
(d)\(r=0.02\)
(e)\(r=-1.5\)

Assessing the Correlation

Which of the following has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?

More Correlation Examples

Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.

Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.

More Correlation Examples

Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.

Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.

Activity: Examine a Linear Model

  1. Make sure you have a copy of the M 3/31 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
  2. Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
  3. Get together with another student.
  4. Discuss your results.
  5. Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/