MTH-361A | Spring 2025 | University of Portland
April 14, 2025
Linear Regression
\[ y = \beta_0 + \beta_1 x + \epsilon \]
Confidence Intervals for Linear Regression
\[ \begin{aligned} \text{slope } \longrightarrow b_1 \pm t_{df}^* \text{SE}_{b_1} \\ \text{intercept } \longrightarrow b_0 \pm t_{df}^* \text{SE}_{b_0} \end{aligned} \]
Hypothesis testing for Linear Regression
Housing sales data from King County, WA, which includes the city of
Seattle. The data houses
in the fosdata
package contains a record of every house sold in King County from May
2014 to May 2015.
There are 21,613 houses in total with 21 variables. Below is a random sample of 4 houses with select variables.
## # A tibble: 4 × 9
## zipcode price bedrooms bathrooms sqft_living sqft_lot floors condition grade
## <int> <dbl> <int> <dbl> <int> <int> <dbl> <int> <int>
## 1 98042 208000 3 2 1250 7995 1 4 7
## 2 98027 588000 4 2.25 2580 7344 2 3 8
## 3 98010 325000 4 1.5 1470 70800 1 3 7
## 4 98109 740000 4 2.75 2890 4000 1.5 4 9
In this example, we only look at the houses that is located on the specific zipcode of 98115. This reduced the data set to 583 houses. Below, we show 4 random samples with select variables.
## # A tibble: 4 × 9
## id price bedrooms bathrooms sqft_living sqft_lot floors condition grade
## <dbl> <dbl> <int> <dbl> <int> <int> <dbl> <int> <int>
## 1 1.03e8 645000 3 1.75 2070 5500 1 4 7
## 2 6.39e9 433500 3 1 1230 6000 1 4 7
## 3 5.10e9 450000 2 1 1380 4390 1 4 8
## 4 5.10e9 445000 4 2.5 2170 5257 2 3 7
Exploring the data. Our focus is to look at the price
,
sqft_living
, and sqft_lot
variables.
The variables appears to be moderately right-skewed.
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]
where \(\beta_0\) is the intercept, \(\beta_1\) and \(\beta_2\) are the coefficients (slopes) of the explanatory variables.
log_sqft_lot
and
log_sqft_living
price
\[\hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \epsilon\]
##
## Call:
## lm(formula = log_price ~ log_sqft_living + log_sqft_lot, data = log_houses)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62790 -0.12085 -0.00189 0.12904 0.75567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.14199 0.19922 40.87 <2e-16 ***
## log_sqft_living 0.65130 0.02303 28.29 <2e-16 ***
## log_sqft_lot 0.03422 0.01755 1.95 0.0517 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.211 on 580 degrees of freedom
## Multiple R-squared: 0.6021, Adjusted R-squared: 0.6007
## F-statistic: 438.8 on 2 and 580 DF, p-value: < 2.2e-16
We ignore most of the output for now and focus on the estimates of the coefficients. Namely, \(b_0 = 8.14199\), \(b_1 = 0.65130\), and \(b_2 = 0.03422\).
The p-values of the individual estimates are shown to be very low, which means statistically significant.
The R-squared is the a measure of how well the simple linear regression fits the data but this is a biased estimate of the amount of variability explained by the model when applied to model with more than one predictor.
\[R^2 = 1 - \frac{SSE}{SST}\]
The adjusted R-squared describes the power of the regression model that contain different numbers of predictors.
This is computed as
\[ \begin{aligned} R_{adj}^{2} &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} \\ &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned} \]
where \(n\) is the number of observations used to fit the model and \(k\) is the number of predictor variables in the model.
.Rmd
file by replacing [name]
with your name
using the format [First name][Last initial]
. Then, open the
.Rmd
file.