MTH 361 A/B SP23 (University of Portland): Lab 4 R Studio Cloud
Content Source: OpenIntro Statistics (4th edition) by David Diez, Mine Cetinkaya-Rundel, and Christopher Barr.
In this lab, we will explore and visualize the data using the
tidyverse
suite of packages. The data can be found in the
companion package for OpenIntro labs, openintro. We
will also use the infer package for resampling. In
addition, we will be using some other packages such as the
skimr, statsr, and
broom as tools to aid us on data processing.
Let’s load the packages.
library(tidyverse)
library(openintro)
library(infer)
library(skimr)
library(statsr)
library(broom)
To create your new lab report, in RStudio, go to New File -> R
Markdown… Then, choose From Template and then choose
Lab Report for OpenIntro Statistics Labs
from the list of
templates.
Setting a seed: We will take some random samples and build sampling distributions in this lab, which means you should set a seed at the start of your lab. If this concept is new to you, review the lab on “Probability and the Normal Distribution”.
You will be analyzing the dataset yrbss
, where you will
delv into a sample from the Youth Risk Behavior Surveillance System
(YRBSS) survey, which uses data from high schoolers to help discover
health patterns.
Remember that you can use filter
to limit the dataset to
just non-helmet wearers. Here, we will name the dataset
no_helmet
.
<- yrbss %>%
no_helmet filter(helmet_12m == "never")
Also, it may be easier to calculate the proportion if you create a
new variable that specifies whether the individual has texted every day
while driving over the past 30 days or not. We will call this variable
text_ind
.
<- no_helmet %>%
no_helmet %>%
drop_na mutate(text_ind = ifelse(text_while_driving_30d == "30", "yes", "no"))
When summarizing the YRBSS, the Centers for Disease Control and Prevention seeks insight into the population parameters. To do this, you can answer the question, “What proportion of people in your sample reported that they have texted while driving each day for the past 30 days?” with a statistic; while the question “What proportion of people on earth have texted while driving each day for the past 30 days?” is answered with an estimate of the parameter.
The inferential tools for estimating population proportion are analogous to those used for means in the last chapter: the confidence interval and the hypothesis test.
%>%
no_helmet specify(response = text_ind, success = "yes") %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "prop") %>%
get_ci(level = 0.95)
Note that since the goal is to construct an interval estimate for a
proportion, it’s necessary to both include the success
argument within specify
, which accounts for the proportion
of non-helmet wearers than have consistently texted while driving the
past 30 days, in this example, and that stat
within
calculate
is here “prop”, signaling that you are trying to
do some sort of inference on a proportion.
What is the margin of error for the estimate of the proportion of non-helmet wearers that have texted while driving each day for the past 30 days based on this survey?
Using the infer
package, calculate confidence
intervals for two other categorical variables (you’ll need to decide
which level to call “success”, and report the associated margins of
error. Interpret the interval in context of the data. It may be helpful
to create new dataframes for each of the two categorical variables
first, and then use these data sets to construct the confidence
intervals.
Imagine you’ve set out to survey 1000 people on two questions: are you at least 6-feet tall? and are you left-handed? Since both of these sample proportions were calculated from the same sample size, they should have the same margin of error, right? Wrong! While the margin of error does change with sample size, it is also affected by the proportion.
Think back to the formula for the standard error: \(SE = \sqrt{p(1-p)/n}\). This is then used in the formula for the margin of error for a 95% confidence interval: \[ ME = 1.96\times SE = 1.96\times\sqrt{p(1-p)/n} \,. \] Since the population proportion \(p\) is in this \(ME\) formula, it should make sense that the margin of error is in some way dependent on the population proportion. We can visualize this relationship by creating a plot of \(ME\) vs. \(p\).
Since sample size is irrelevant to this discussion, let’s just set it to some value (\(n = 1000\)) and use this value in the following calculations:
<- 1000 n
The first step is to make a variable p
that is a
sequence from 0 to 1 with each number incremented by 0.01. You can then
create a variable of the margin of error (me
) associated
with each of these values of p
using the familiar
approximate formula (\(ME = 2 \times
SE\)).
<- seq(from = 0, to = 1, by = 0.01)
p <- 2 * sqrt(p * (1 - p)/n) me
Lastly, you can plot the two variables against each other to reveal
their relationship. To do so, we need to first put these variables in a
data frame that you can call in the ggplot
function.
<- data.frame(p = p, me = me)
dd ggplot(data = dd, aes(x = p, y = me)) +
geom_line() +
labs(x = "Population Proportion", y = "Margin of Error")
p
and
me
. Include the margin of error vs. population proportion
plot you constructed in your answer. For a given sample size, for which
value of p
is margin of error maximized?For some of the exercises below, you will conduct inference comparing
two proportions. In such cases, you have a response variable that is
categorical, and an explanatory variable that is also categorical, and
you are comparing the proportions of success of the response variable
across the levels of the explanatory variable. This means that when
using infer
, you need to include both variables within
specify
.
Is there convincing evidence that those who sleep 10+ hours per day are more likely to strength train every day of the week? As always, write out the hypotheses for any tests you conduct and outline the status of the conditions for inference. If you find a significant difference, also quantify this difference with a confidence interval.
Let’s say there has been no difference in likeliness to strength train every day of the week for those who sleep 10+ hours. What is the probability that you could detect a change (at a significance level of 0.05) simply by chance? Hint: Review the definition of the Type 1 error.
Suppose you’re hired by the local government to estimate the
proportion of residents that attend a religious service on a weekly
basis. According to the guidelines, the estimate must have a margin of
error no greater than 1% with 95% confidence. You have no idea what to
expect for \(p\). How many people would
you have to sample to ensure that you are within the guidelines?
Hint: Refer to your plot of the relationship between \(p\) and margin of error. This question does
not require using a dataset.
Every two years, the Centers for Disease Control and Prevention conduct the Youth Risk Behavior Surveillance System (YRBSS) survey, where it takes data from high schoolers (9th through 12th grade), to analyze health patterns. You will work with a selected group of variables from a random sample of observations during one of the years the YRBSS was conducted.
Load the yrbss
data set into your workspace.
data(yrbss)
There are observations on 13 different variables, some categorical and some numerical. The meaning of each variable can be found by bringing up the help file:
?yrbss
Remember that you can answer this question by viewing the data in the data viewer or by using the following command:
glimpse(yrbss)
You will first start with analyzing the weight of the participants in
kilograms: weight
.
Using visualization and summary statistics, describe the distribution
of weights. The skim()
function from the
skimr package produces nice summaries of the variables
in the dataset, separating categorical (character) variables from
quantitative variables.
%>%
yrbss skim()
Next, consider the possible relationship between a high schooler’s weight and their physical activity. Plotting the data is a useful first step because it helps us quickly visualize trends, identify strong associations, and develop research questions.
First, let’s create a new variable physical_3plus
, which
will be coded as either “yes” if the student is physically active for
at least 3 days a week, and “no” if not.
<- yrbss %>%
yrbss drop_na() %>%
mutate(physical_3plus = if_else(physically_active_7d > 2, "yes", "no"))
physical_3plus
and
weight
. Is there a relationship between these two
variables? What did you expect and why?The box plots show how the medians of the two distributions compare,
but we can also compare the means of the distributions using the
following to first group the data by the physical_3plus
variable, and then calculate the mean weight
in these
groups using the mean
function while ignoring missing
values by setting the na.rm
argument to
TRUE
.
%>%
yrbss group_by(physical_3plus) %>%
summarise(mean_weight = mean(weight, na.rm = TRUE))
There is an observed difference, but is this difference large enough to deem it “statistically significant”? In order to answer this question we will conduct a hypothesis test.
Are all conditions necessary for inference satisfied? Comment on
each. You can compute the group sizes with the summarize
command above by defining a new variable with the definition
n()
.
Write the hypotheses for testing if the average weights are different for those who exercise at least times a week and those who don’t.
Next, we will work through creating a permutation distribution using tools from the infer package.
But first, we need to initialize the test, which we will save as
obs_diff
.
<- yrbss %>%
obs_diff specify(weight ~ physical_3plus) %>%
calculate(stat = "diff in means", order = c("yes", "no"))
Recall that the specify()
function is used to specify
the variables you are considering (notated y ~x
), and you
can use the calculate()
function to specify the
stat
istic you want to calculate and the order
of subtraction you want to use. For this hypothesis, the statistic you
are searching for is the difference in means, with the order being
yes - no
.
After you have calculated your observed statistic, you need to create
a permutation distribution. This is the distribution that is created by
shuffling the observed weights into new physical_3plus
groups, labeled “yes” and “no”
We will save the permutation distribution as
null_dist
.
<- yrbss %>%
null_dist specify(weight ~ physical_3plus) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("yes", "no"))
The hypothesize()
function is used to declare what the
null hypothesis is. Here, we are assuming that student’s weight is
independent of whether they exercise at least 3 days or not.
We should also note that the type
argument within
generate()
is set to "permute"
. This ensures
that the statistics calculated by the calculate()
function
come from a reshuffling of the data (not a resampling of the data)!
Finally, the specify()
and calculate()
steps
should look familiar, since they are the same as what we used to find
the observed difference in means!
We can visualize this null distribution with the following code:
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()
Add a vertical red line to the plot above, demonstrating where
the observed difference in means (obs_diff
) falls on the
distribution.
How many of these null_dist
permutations have a
difference at least as large (or larger) as
obs_diff
?
Now that you have calculated the observed statistic and generated a
permutation distribution, you can calculate the p-value for your
hypothesis test using the function get_p_value()
from the
infer package.
%>%
null_dist get_p_value(obs_stat = obs_diff, direction = "two_sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step. See
## `?get_p_value()` for more information.
What warning message do you get? Why do you think you get this warning message?
Construct and record a confidence interval for the difference between the weights of those who exercise at least three times a week and those who don’t, and interpret this interval in context of the data.
Calculate a 95% confidence interval for the average height in
meters (height
) and interpret it in context.
Conduct a hypothesis test evaluating whether the average height is different for those who exercise at least three times a week and those who don’t.
The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through a bunch of different variables for many countries around the globe. It serves as a rough objective measure for the relationships between the different types of freedom - whether it’s political, religious, economical or personal freedom - and other social and economic circumstances. The Human Freedom Index is an annually co-published report by the Cato Institute, the Fraser Institute, and the Liberales Institut at the Friedrich Naumann Foundation for Freedom.
You’ll be analysing data from the Human Freedom Index reports. Your aim will be to summarize a few of the relationships within the data both graphically and numerically in order to find which variables can help tell a story about freedom.
The data we’re working with is in the openintro package and it’s
called hfi
, short for Human Freedom Index.
What are the dimensions of the dataset? What does each row
represent? The dataset spans a lot of years, but we are only interested
in data from year 2016. Filter the data hfi
data frame for
year 2016, select the six variables, and assign the result to a data
frame named hfi_2016
.
What type of plot would you use to display the relationship
between the personal freedom score, pf_score
, and
pf_expression_control
? Plot this relationship using the
variable pf_expression_control
as the predictor. Does the
relationship look linear? If you knew a country’s
pf_expression_control
, or its score out of 10, with 0 being
the most, of political pressures and controls on media content, would
you be comfortable using a linear model to predict the personal freedom
score?
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
%>%
hfi_2016 summarise(cor(pf_expression_control, pf_score))
In this section, you will use an interactive function to investigate
what we mean by “sum of squared residuals”. You will need to run this
function in your console, not in your markdown document. Running the
function also requires that the hfi
dataset is loaded in
your environment. You will also need to make sure the Plots tab in the
lower right-hand corner has enough space to make a plot.
Think back to the way that we described the distribution of a single
variable. Recall that we discussed characteristics such as center,
spread, and shape. It’s also useful to be able to describe the
relationship of two numerical variables, such as
pf_expression_control
and pf_score
above.
Just as you’ve used the mean and standard deviation to summarize a single variable, you can summarize the relationship between these two variables by finding the line that best follows their association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016)
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue.
If your plot is appearing below your code chunk and won’t let you select points to make a line, take the following steps:
Recall that the residuals are the difference between the observed values and the values predicted by the line:
\[ e_i = y_i - \hat{y}_i \]
The most common way to do linear regression is to select the line
that minimizes the sum of squared residuals. To visualize the squared
residuals, you can rerun the plot command and add the argument
showSquares = TRUE
.
plot_ss(x = pf_expression_control, y = pf_score, data = hfi_2016, showSquares = TRUE)
Note that the output from the plot_ss
function provides
you with the slope and intercept of your line as well as the sum of
squares.
plot_ss
, choose a line that does a good job of
minimizing the sum of squares. Run the function several times. What was
the smallest sum of squares that you got? How does it compare to your
neighbours?It is rather cumbersome to try to get the correct least squares line,
i.e. the line that minimizes the sum of squared residuals, through trial
and error. Instead, you can use the lm
function in R to fit
the linear model (a.k.a. regression line).
<- lm(pf_score ~ pf_expression_control, data = hfi_2016) m1
The first argument in the function lm()
is a formula
that takes the form y ~ x
. Here it can be read that we want
to make a linear model of pf_score
as a function of
pf_expression_control
. The second argument specifies that R
should look in the hfi
data frame to find the two
variables.
Note: Piping will not work here, as the data frame is not the first argument!
The output of lm()
is an object that contains all of the
information we need about the linear model that was just fit. We can
access this information using the tidy()
function.
tidy(m1)
Let’s consider this output piece by piece. First, the formula used to
describe the model is shown at the top, in what’s displayed as the
“Call”. After the formula you find the five-number summary of the
residuals. The “Coefficients” table shown next is key; its first column
displays the linear model’s y-intercept and the coefficient of
pf_expression_control
. With this table, we can write down
the least squares regression line for the linear model:
\[ \hat{y} = 4.28 + 0.542 \times pf\_expression\_control \]
This equation tells us two things:
pf_expression_control
of 0 (those
with the largest amount of political pressure on media content), we
expect their mean personal freedom score to be 4.28.pf_expression_control
, we
expect a country’s mean personal freedom score to increase 0.542
units.We can assess model fit using \(R^2\), the proportion of variability in the
response variable that is explained by the explanatory variable. We use
the glance()
function to access this information.
glance(m1)
For this model, 71.4% of the variability in pf_score
is
explained by pf_expression_control
.
pf_expression_control
to
predict hf_score
, or the total human freedom score. Using
the estimates from the R output, write the equation of the regression
line. What does the slope tell us in the context of the relationship
between human freedom and the amount of political pressure on media
content?Let’s create a scatterplot with the least squares line for
m1
laid on top.
ggplot(data = hfi_2016, aes(x = pf_expression_control, y = pf_score)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
Here, we are literally adding a layer on top of our plot.
geom_smooth
creates the line by fitting a linear model. It
can also show us the standard error se
associated with our
line, but we’ll suppress that for now.
This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
pf_expression_control
? Is this an
overestimate or an underestimate, and by how much? In other words, what
is the residual for this prediction?To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
In order to do these checks we need access to the fitted (predicted)
values and the residuals. We can use the augment()
function
to calculate these.
<- augment(m1) m1_aug
Linearity: You already checked if the relationship
between pf_score
and pf_expression_control
is
linear using a scatterplot. We should also verify this condition with a
plot of the residuals vs. fitted (predicted) values.
ggplot(data = m1_aug, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
xlab("Fitted values") +
ylab("Residuals")
Notice here that m1
can also serve as a data set because
stored within it are the fitted values (\(\hat{y}\)) and the residuals. Also note
that we’re getting fancy with the code here. After creating the
scatterplot on the first layer (first line of code), we overlay a red
horizontal dashed line at \(y = 0\) (to
help us check whether the residuals are distributed around 0), and we
also rename the axis labels to be more informative.
Nearly normal residuals: To check this condition, we can look at a histogram of the residuals.
ggplot(data = m1_aug, aes(x = .resid)) +
geom_histogram(binwidth = 0.25) +
xlab("Residuals")
Constant variability:
Choose another variable that you think would strongly correlate
with pf_score
. Produce a scatterplot of the two variables
and fit a linear model. At a glance, does there seem to be a linear
relationship?
How does this relationship compare to the relationship between
pf_score
and pf_expression_control
? Use the
\(R^2\) values from the two model
summaries to compare. Does your independent variable seem to predict
pf_score
better? Why or why not?
Check the model diagnostics using appropriate visualisations and evaluate if the model conditions have been met.
This
work is licensed under a
Creative
Commons Attribution-ShareAlike 4.0 International License.