Multiple Linear Regression

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

April 14, 2025

Objectives

Introduce multiple linear regression
Develop an understanding on how to apply and assess a multiple linear regression model
Know how to conduct multiple linear regression in R
Activity: Selecting Variables for Multiple Linear Regression

Previously… (1/2)

Linear Regression

\[ y = \beta_0 + \beta_1 x + \epsilon \]

\(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. The notation \(\epsilon\) is the model’s error (residuals).
We estimate the slope \(b_1\) and intercept \(b_0\) of the least squares regression line by minimizing the sum of squared residuals.
The slope is given by \(b_1 = \frac{s_y}{s_x} r\) , and the intercept is \(b_0 = \bar{y} - b_1 \bar{x}\) , where \(r\) is the correlation coefficient, \(s_x\) and \(s_y\) are the standard deviations of \(x\) and \(y\), and \(\bar{x}\), \(\bar{y}\) are their respective means.

Previously… (2/2)

Confidence Intervals for Linear Regression

\[ \begin{aligned} \text{slope } \longrightarrow b_1 \pm t_{df}^* \text{SE}_{b_1} \\ \text{intercept } \longrightarrow b_0 \pm t_{df}^* \text{SE}_{b_0} \end{aligned} \]

Hypothesis testing for Linear Regression

Null hypothesis (\(H_0\)): The two numerical variables has no linear relationship. \[\beta_0 = 0\]
Alternative hypothesis (\(H_A\)): The two numerical variables has some linear relationship. \[\beta_0 \ne 0\]
Depending on the context, we use \(\ne\) sign for two-sided tests, or \(<\) or \(>\) for one-sided test.

Case Study I

Housing sales data from King County, WA, which includes the city of Seattle. The data houses in the fosdata package contains a record of every house sold in King County from May 2014 to May 2015.

There are 21,613 houses in total with 21 variables. Below is a random sample of 4 houses with select variables.

## # A tibble: 4 × 9
##   zipcode  price bedrooms bathrooms sqft_living sqft_lot floors condition grade
##     <int>  <dbl>    <int>     <dbl>       <int>    <int>  <dbl>     <int> <int>
## 1   98042 208000        3      2           1250     7995    1           4     7
## 2   98027 588000        4      2.25        2580     7344    2           3     8
## 3   98010 325000        4      1.5         1470    70800    1           3     7
## 4   98109 740000        4      2.75        2890     4000    1.5         4     9

Case Study I: Subset

In this example, we only look at the houses that is located on the specific zipcode of 98115. This reduced the data set to 583 houses. Below, we show 4 random samples with select variables.

## # A tibble: 4 × 9
##         id  price bedrooms bathrooms sqft_living sqft_lot floors condition grade
##      <dbl>  <dbl>    <int>     <dbl>       <int>    <int>  <dbl>     <int> <int>
## 1   1.03e8 645000        3      1.75        2070     5500      1         4     7
## 2   6.39e9 433500        3      1           1230     6000      1         4     7
## 3   5.10e9 450000        2      1           1380     4390      1         4     8
## 4   5.10e9 445000        4      2.5         2170     5257      2         3     7

Case Study I: Data Exploration (1/3)

Exploring the data. Our focus is to look at the price, sqft_living, and sqft_lot variables.

The variables appears to be moderately right-skewed.

Case Study I: Data Exploration (2/3)

Case Study I: Data Exploration (3/3)

Case Study I: The Multiple Linear Regression Model (1/2)

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]

where \(\beta_0\) is the intercept, \(\beta_1\) and \(\beta_2\) are the coefficients (slopes) of the explanatory variables.

Explanatory variables: log_sqft_lot and log_sqft_living
Response Variable: price

Case Study I: The Multiple Linear Regression Model (2/2)

\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \epsilon\]

mlr_mod <- lm(log_price ~ log_sqft_living + log_sqft_lot, data = log_houses)
summary(mlr_mod)

## 
## Call:
## lm(formula = log_price ~ log_sqft_living + log_sqft_lot, data = log_houses)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62790 -0.12085 -0.00189  0.12904  0.75567 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.14199    0.19922   40.87   <2e-16 ***
## log_sqft_living  0.65130    0.02303   28.29   <2e-16 ***
## log_sqft_lot     0.03422    0.01755    1.95   0.0517 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.211 on 580 degrees of freedom
## Multiple R-squared:  0.6021, Adjusted R-squared:  0.6007 
## F-statistic: 438.8 on 2 and 580 DF,  p-value: < 2.2e-16

We ignore most of the output for now and focus on the estimates of the coefficients. Namely, \(b_0 = 8.14199\), \(b_1 = 0.65130\), and \(b_2 = 0.03422\).

The p-values of the individual estimates are shown to be very low, which means statistically significant.

Assesing the Multiple Linear Regression Model

The R-squared is the a measure of how well the simple linear regression fits the data but this is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.

\[R^2 = 1 - \frac{SSE}{SST}\]

The adjusted R-squared describes the power of the regression model that contain different numbers of predictors.

This is computed as

\[ \begin{aligned} R_{adj}^{2} &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} \\ &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned} \]

where \(n\) is the number of observations used to fit the model and \(k\) is the number of predictor variables in the model.

Activity: Selecting Variables for Multiple Linear Regression

Log-in to Posit Cloud and open the R Studio assignment M 4/14 - Selecting Variables for Multiple Linear Regression.
Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
Change the author in the YAML header.
Read the provided instructions.
Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/

Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/