Multiple Linear Regression

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

April 14, 2025

Objectives

Previously… (1/2)

Linear Regression

\[ y = \beta_0 + \beta_1 x + \epsilon \]

Previously… (2/2)

Confidence Intervals for Linear Regression

\[ \begin{aligned} \text{slope } \longrightarrow b_1 \pm t_{df}^* \text{SE}_{b_1} \\ \text{intercept } \longrightarrow b_0 \pm t_{df}^* \text{SE}_{b_0} \end{aligned} \]

Hypothesis testing for Linear Regression

Case Study I

Housing sales data from King County, WA, which includes the city of Seattle. The data houses in the fosdata package contains a record of every house sold in King County from May 2014 to May 2015.

There are 21,613 houses in total with 21 variables. Below is a random sample of 4 houses with select variables.

## # A tibble: 4 × 9
##   zipcode  price bedrooms bathrooms sqft_living sqft_lot floors condition grade
##     <int>  <dbl>    <int>     <dbl>       <int>    <int>  <dbl>     <int> <int>
## 1   98042 208000        3      2           1250     7995    1           4     7
## 2   98027 588000        4      2.25        2580     7344    2           3     8
## 3   98010 325000        4      1.5         1470    70800    1           3     7
## 4   98109 740000        4      2.75        2890     4000    1.5         4     9

Case Study I: Subset

In this example, we only look at the houses that is located on the specific zipcode of 98115. This reduced the data set to 583 houses. Below, we show 4 random samples with select variables.

## # A tibble: 4 × 9
##         id  price bedrooms bathrooms sqft_living sqft_lot floors condition grade
##      <dbl>  <dbl>    <int>     <dbl>       <int>    <int>  <dbl>     <int> <int>
## 1   1.03e8 645000        3      1.75        2070     5500      1         4     7
## 2   6.39e9 433500        3      1           1230     6000      1         4     7
## 3   5.10e9 450000        2      1           1380     4390      1         4     8
## 4   5.10e9 445000        4      2.5         2170     5257      2         3     7

Case Study I: Data Exploration (1/3)

Exploring the data. Our focus is to look at the price, sqft_living, and sqft_lot variables.

The variables appears to be moderately right-skewed.

Case Study I: Data Exploration (2/3)

Case Study I: Data Exploration (3/3)

Case Study I: The Multiple Linear Regression Model (1/2)

\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]

where \(\beta_0\) is the intercept, \(\beta_1\) and \(\beta_2\) are the coefficients (slopes) of the explanatory variables.

Case Study I: The Multiple Linear Regression Model (2/2)

\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \epsilon\]

mlr_mod <- lm(log_price ~ log_sqft_living + log_sqft_lot, data = log_houses)
summary(mlr_mod)
## 
## Call:
## lm(formula = log_price ~ log_sqft_living + log_sqft_lot, data = log_houses)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62790 -0.12085 -0.00189  0.12904  0.75567 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.14199    0.19922   40.87   <2e-16 ***
## log_sqft_living  0.65130    0.02303   28.29   <2e-16 ***
## log_sqft_lot     0.03422    0.01755    1.95   0.0517 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.211 on 580 degrees of freedom
## Multiple R-squared:  0.6021, Adjusted R-squared:  0.6007 
## F-statistic: 438.8 on 2 and 580 DF,  p-value: < 2.2e-16

We ignore most of the output for now and focus on the estimates of the coefficients. Namely, \(b_0 = 8.14199\), \(b_1 = 0.65130\), and \(b_2 = 0.03422\).

The p-values of the individual estimates are shown to be very low, which means statistically significant.

Assesing the Multiple Linear Regression Model

The R-squared is the a measure of how well the simple linear regression fits the data but this is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.

\[R^2 = 1 - \frac{SSE}{SST}\]

The adjusted R-squared describes the power of the regression model that contain different numbers of predictors.

This is computed as

\[ \begin{aligned} R_{adj}^{2} &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} \\ &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned} \]

where \(n\) is the number of observations used to fit the model and \(k\) is the number of predictor variables in the model.

Activity: Selecting Variables for Multiple Linear Regression

  1. Log-in to Posit Cloud and open the R Studio assignment M 4/14 - Selecting Variables for Multiple Linear Regression.
  2. Make sure you are in the current working directory. Rename the .Rmd file by replacing [name] with your name using the format [First name][Last initial]. Then, open the .Rmd file.
  3. Change the author in the YAML header.
  4. Read the provided instructions.
  5. Answer all exercise problems on the designated sections.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/