MTH-361A | Spring 2025 | University of Portland
March 31, 2025
Relationship Between Variables
\[\text{explanatory variable} \xrightarrow{\text{might affect}} \text{response variable}\]
Associated vs Independent Variables
When two variables show some connection with one another, they are called associated or dependent variables.
In general, association does not imply causation, and causation can only be inferred from a randomized experiment.
The scatterplot on the right shows the relationship between HS graduate rate in all 50 US states and DC and the % of residents who live below the poverty line (income below $23,050 for a family of 4 in 2012).
A linear model is written as
\[ y = \beta_0 + \beta_1 x + \epsilon \]
where \(y\) is the outcome, \(x\) is the predictor, \(\beta_0\) is the intercept, and \(\beta_1\) is the slope. The notation \(\epsilon\) is the model’s error.
Notation:
We can use the sample statistics \(b_0\) and \(b_1\) as point estimates to infer the true value of the population parameters \(\beta_0\) and \(\beta_1\).
The linear model for predicting poverty from high school graduation rate in the US is
\[ \widehat{poverty} = 64.78 - 0.62 \times HS_{grad} \]
where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).
The “hat” in the \(\widehat{poverty}\) indicates an estimated/predicted outcome.
The high school graduate rate in Georgia is 85.1%.
What poverty level does the model predict for this state?
The linear model for predicting poverty from high school graduation rate in the US is
\[ \widehat{poverty} = 64.78 - 0.62 \times HS_{grad} \]
where the sample statistics are the slope is \(b_1 = - 0.62\) and the intercept is \(b_0 = 64.78\).
Which of the following appears to be the line that best fits the linear relationship between % in poverty and % HS grad? Choose one.
A Residual of the \(i^{th}\) observation \((x_i,y_i)\) is the difference between the observed (\(y_i\)) and estimated/predicted \(\hat{y}_i\).
\[ \epsilon_i = y_i - \hat{y}_i \]
The relationship of two numerical variables shown in the right is moderately strong linear negative relationship.
Correlation (notation: \(r\)) describes the strength of the linear association between two numerical variables.
Example:
Which of the following is the best guess for the correlation between % in poverty and % HS grad?
(a)\(r=0.6\)
(b)\(r=-0.75\)
(c)\(r=-0.1\)
(d)\(r=0.02\)
(e)\(r=-1.5\)
Which of the following has the strongest correlation, i.e. correlation coefficient closest to +1 or -1?
Sample scatterplots and their correlations. The first row shows variables with a positive relationship, represented by the trend up and to the right. The second row shows variables with a negative trend, where a large value in one variable is associated with a lower value in the other.
Sample scatterplots and their correlations. In each case, there is a strong relationship between the variables. However, because the relationship is not linear, the correlation is relatively weak.
.pdf
file.