Study Design and Inference

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

February 3, 2025

Objectives

Previously… (1/2)

The guiding principle of statistics is statistical thinking.

Statistical Thinking in the Data Science Life Cycle

Statistical Thinking in the Data Science Life Cycle

Previously… (2/2)

Types of Variables

Types of Variables

Types of Variables

Relationship Among Variables

Does there appear to be a relationship between the hours of study per week and the GPA of a student?

\(\star\) As hours of study increases, the GPA also increases but for study hours around 0 to 30 hours, there is a lot of variation. There is one student with GPA > 4.0, this is likely a data error.

Explanatory vs Response Variables

\[\text{explanatory variable} \xrightarrow{\text{might affect}} \text{response variable}\]

Associated vs Independent Variables

Case Study I: Population vs Sample

Research Question: Can people become better, more efficient runners on their own, merely by running?

\(\star\) The random sample taken can only be generalized for adult women, not all people.

Case Study I: Anecdotal Evidence

Research Question: Can smoking contribute to negative health outcomes?

\(\star\) Anecdotal evidence refers to information or conclusions drawn from personal experiences, individual stories, or isolated examples rather than systematic data collection or rigorous scientific analysis.

Case Study II: Early Smoking Research

Research Question: Can smoking contribute to negative health outcomes?

\(\star\) Overgeneralization is a logical fallacy where a conclusion is drawn from insufficient or unrepresentative evidence, applying it too broadly.

Census

Wouldn’t it be better to just include everyone and “sample” the entire population?

This is called a Census.

Problems with taking a census:

Sampling Bias

What is sampling bias? It occurs when the sample collected for a study or survey is not representative of the larger population that the study aims to analyze.

There are several ways sampling bias can occur, such as:

Case Study III: Sampling Bias (1/2)

NPR: Illegal Immigrants Reluctant to Fill Out Census Form

NPR: See 200 Years Of Twists And Turns Of Census Citizenship Questions

Case Study III: Sampling Bias (2/2)

US Census: Citizenship Question Effects on Household Survey Response

Exploratory Analysis to Inference

What is Exploratory Analysis? It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.

Inference

What is inference? It is the process of drawing conclusions about a population based on sample data. This involves using data from a sample to make generalizations, predictions, or decisions about a larger group.

Types of Inference

Parameter Estimation Hypothesis Testing
Goal Estimate an unknown population value Assess claims about a population value
Methods Point Estimation: A single value estimate (e.g., sample mean)
Interval Estimation: A range of plausible values (e.g., confidence interval)
State a null and an alternative hypothesis
Compute a test statistic and compare it to a threshold (p-value or critical value)
Key Concept Focuses on precision in estimation (confidence intervals) Focuses on decision-making based on evidence (reject or fail to reject the null hypothesis)

\(\star\) Key Idea: Parameter estimation focuses on finding the best estimate of an unknown population value, while hypothesis testing determines whether there is enough evidence to support or reject a claim about the population.

Obtaining Good Samples

Simple Random Sample

What is simple random sampling? Randomly select cases from the population, where there is no implied connection between the points that are selected.

Stratified Sample

What is stratified sampling? Strata are made up of similar observations. We take a simple random sample from each stratum.

Cluster Sample

What is cluster sampling? Clusters are usually not made up of homogeneous observations. We take a simple random sample of clusters, and then sample all observations in that cluster.

Multistage Sample

What is multistage sampling? Clusters are usually not made up of homogeneous observations. We take a simple random sample of clusters, and then take a simple random sample of observations from the sampled clusters.

Case Study IV: Statified vs Clustered Sampling

Scenario: A hospital wants to survey nurse job satisfaction across departments.

Sampling Method Process Example
Stratified Divide nurses into departments (strata) and sample proportionally from each group. 8 emergency, 6 ICU, 10 pediatrics, 16 general medicine
Clustered Divide hospital into floors (clusters) and randomly select entire floors, surveying all nurses there. Select 2 random floors and survey all nurses on those floors.

Key Difference:

Types of Studies

Observational Experimental
Researchers observe subjects without interference. Researchers intervene by applying treatments to subjects.
No treatment or manipulation is imposed. Includes a control and treatment groups with random assignments (ideally).
Used to find associations, not causation. Can determine causal relationships.

\(\star\) Key Difference: Observational studies find patterns, while experimental studies test cause-and-effect.

Case Study V

Is there a relationship between smoking and lung cancer?

Study Design:

Findings:

\(\star\) Since this is observational, it cannot prove smoking causes lung cancer –other factors (e.g., genetics, pollution) may also contribute. However, strong correlations from multiple studies can strengthen this conclusion.

Types of Observational Studies

Aspect Case-Control Cohort (Longitudinal) Cross-Sectional
Study Design Compares individuals with a condition (cases) to those without (controls). Follows groups of individuals over time, observing exposures and outcomes. Measures a population at a single point in time, observing various variables.
Main Focus Identifying exposures or risk factors associated with an outcome. Observing how exposures lead to outcomes over time. Examining the prevalence of variables or conditions at a given time.
Temporal Sequence Retrospective –looks back in time to find past exposures. Prospective –follows participants forward in time. No temporal sequence – snapshot of a population at a single time point.
Data Collection Collects past data (often using medical records or interviews). Collects data over time, often requiring repeated observations or surveys. Collects data at one point in time.

\(\star\) Key Differences: Case-Control looks at data in the past, Cohort follows the data, and Cross-Sectional looks at data at one point in time.

Strengths and Limitations of Observational Studies

Aspect Case-Control Cohort (Longitudinal) Cross-Sectional
Strengths Good for studying rare diseases, cost-effective, relatively quick. Can establish temporal relationships, good for studying causes and effects. Quick, inexpensive, good for identifying associations.
Limitations Cannot establish causality, relies on recall bias. Expensive, time-consuming, and prone to participant attrition. Cannot determine causality, only associations.

\(\star\) Key Similarities: The limitation of observational studies is that it can not determine causality, only associations.

Prospective vs Retrospective Observational Studies

Study Type Description Strengths Limitations
Prospective Study Researchers follow subjects forward in time, starting with an exposure and observing future outcomes. Can establish a temporal relationship between exposure and outcome, reduces recall bias. Expensive, time-consuming, potential participant dropout.
Retrospective Study Researchers analyze past data, identifying subjects with an outcome and looking back to determine exposure. Quick, cost-effective, useful for rare diseases or long-term effects. Prone to recall bias, missing or incomplete data, cannot establish causality.

\(\star\) Key Differences: Prospective means present and future data and retrospective means the past data.

Case Study VI

Is there a relationship between hypertension and stroke incidence in an older population?

Study Design:

Findings:

\(\star\) This is an example of a retrospective cohort Study because the data is in the past and the design involves groups.

Case Study VII

Does energy gels make a person run faster?

Study Design:

Findings:

\(\star\) This is an example of an experimental study because the design involves an intervention, which is the treatment group (with intervention) and compared it to the control group (without intervention).

Case Study VI: Blocking

Does energy gels make a person run faster? Since it is suspected that energy gels might affect pro and amateur athletes differently, we block for pro status.

Study Design:

Findings:

\(\dagger\) Why is is blocking important? Can you think of other variables to block for?

\(\star\) Since this is an experimental study, we can conclude a causal relationship between use of energy gels and faster running.

Principles of Experimental Design

Principle Description
Control Compare treatment of interest to a control group.
Randomize Randomly assign subjects to treatments, and randomly sample from the population whenever possible.
Replicate Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study.
Block If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups.

\(\star\) Key Idea: Experimental studies establish a cause-and-effect relationship by manipulating independent variables and observing their impact on dependent variables while controlling for confounding factors.

Blocking vs Explanatory Variables

Aspect Blocking Explanatory
Definition Characteristics that experimental units come with and that we want to control for. Variables that we manipulate or observe to explain the outcome of the experiment.
Purpose Used to reduce variability by grouping experimental units with similar traits. Used to explore or test the effect of a treatment or intervention on outcomes.
Role in Experiment Serve as a way to control for potential confounders and reduce bias. Act as the independent variable(s) whose effect on the dependent variable is tested.
Timing in Experiment Applied before random assignment to ensure balanced groups. Manipulated or measured during the experiment to observe their effect.

\(\star\) Key Idea: Explanatory variables are factors tested for their impact, while blocking groups subjects to reduce confounding effects.

More Experimental Design Terminology

Random Assignment vs Random Sampling

Activity: Project Phase 1 - Group Formation and Data Selection

  1. Log-in to Posit Cloud and open your Calculator [First name][Last initial].
  2. Create a new RMarkdown file and modify it with your name, and names of your group. Remove the default texts.
  3. Choose top 3 data sets in the Project Data Sets page. Install and load the necessary packages, then look at the information about the data set you have chosen by using the ? command. For example, ?iris outputs information about the iris data frame.
  4. In your Rmarkdown file, create sections for each data set chosen, and start exploring it. For this phase of the project, here are the things to consider when exploring the data sets:
    1. Install and load necessary packages, then load the data and –if necessary– convert it to a tibble.
    2. Examine the variables (columns) of the data set and determine the variable types.
    3. Determine if there are missing values (NA values).
  5. When finished, knit your .Rmd to .html, then Submit your .Rmd and the recently knitted .html to Moodle.

\(\star\) Your report for this phase will be due on the next phase of the project. Your group and final data sets will be announced.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/
Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/