MTH-361A | Spring 2025 | University of Portland
February 3, 2025
The guiding principle of statistics is statistical thinking.
Statistical Thinking in the Data Science Life Cycle
Types of Variables
Types of Variables
Does there appear to be a relationship between the hours of study per week and the GPA of a student?
\(\star\) As hours of study increases, the GPA also increases but for study hours around 0 to 30 hours, there is a lot of variation. There is one student with GPA > 4.0, this is likely a data error.
\[\text{explanatory variable} \xrightarrow{\text{might affect}} \text{response variable}\]
When two variables show some connection with one another, they are called associated variables.
Associated variables can also be called dependent variables and vice-versa.
If two variables are not associated, i.e. there is no evident connection between the two, then they are said to be independent.
In general, association does not imply causation, and causation can only be inferred from a randomized experiment.
Research Question: Can people become better, more efficient runners on their own, merely by running?
\(\star\) The random sample taken can only be generalized for adult women, not all people.
Research Question: Can smoking contribute to negative health outcomes?
\(\star\) Anecdotal evidence refers to information or conclusions drawn from personal experiences, individual stories, or isolated examples rather than systematic data collection or rigorous scientific analysis.
Research Question: Can smoking contribute to negative health outcomes?
\(\star\) Overgeneralization is a logical fallacy where a conclusion is drawn from insufficient or unrepresentative evidence, applying it too broadly.
Wouldn’t it be better to just include everyone and “sample” the entire population?
This is called a Census.
Problems with taking a census:
What is sampling bias? It occurs when the sample collected for a study or survey is not representative of the larger population that the study aims to analyze.
There are several ways sampling bias can occur, such as:
What is Exploratory Analysis? It is the process of analyzing and summarizing datasets to uncover patterns, trends, relationships, and anomalies before inference.
What is inference? It is the process of drawing conclusions about a population based on sample data. This involves using data from a sample to make generalizations, predictions, or decisions about a larger group.
Parameter Estimation | Hypothesis Testing | |
---|---|---|
Goal | Estimate an unknown population value | Assess claims about a population value |
Methods | Point Estimation: A single value estimate (e.g., sample
mean) Interval Estimation: A range of plausible values (e.g., confidence interval) |
State a null and an alternative hypothesis Compute a test statistic and compare it to a threshold (p-value or critical value) |
Key Concept | Focuses on precision in estimation (confidence intervals) | Focuses on decision-making based on evidence (reject or fail to reject the null hypothesis) |
\(\star\) Key Idea: Parameter estimation focuses on finding the best estimate of an unknown population value, while hypothesis testing determines whether there is enough evidence to support or reject a claim about the population.
Almost all statistical methods are based on the notion of implied randomness.
Most commonly used random sampling techniques are simple, stratified, and cluster sampling.
What is simple random sampling? Randomly select cases from the population, where there is no implied connection between the points that are selected.
What is stratified sampling? Strata are made up of similar observations. We take a simple random sample from each stratum.
What is cluster sampling? Clusters are usually not made up of homogeneous observations. We take a simple random sample of clusters, and then sample all observations in that cluster.
What is multistage sampling? Clusters are usually not made up of homogeneous observations. We take a simple random sample of clusters, and then take a simple random sample of observations from the sampled clusters.
Scenario: A hospital wants to survey nurse job satisfaction across departments.
Sampling Method | Process | Example |
---|---|---|
Stratified | Divide nurses into departments (strata) and sample proportionally from each group. | 8 emergency, 6 ICU, 10 pediatrics, 16 general medicine |
Clustered | Divide hospital into floors (clusters) and randomly select entire floors, surveying all nurses there. | Select 2 random floors and survey all nurses on those floors. |
Key Difference:
Observational | Experimental |
---|---|
Researchers observe subjects without interference. | Researchers intervene by applying treatments to subjects. |
No treatment or manipulation is imposed. | Includes a control and treatment groups with random assignments (ideally). |
Used to find associations, not causation. | Can determine causal relationships. |
\(\star\) Key Difference: Observational studies find patterns, while experimental studies test cause-and-effect.
Is there a relationship between smoking and lung cancer?
Study Design:
Findings:
\(\star\) Since this is observational, it cannot prove smoking causes lung cancer –other factors (e.g., genetics, pollution) may also contribute. However, strong correlations from multiple studies can strengthen this conclusion.
Aspect | Case-Control | Cohort (Longitudinal) | Cross-Sectional |
---|---|---|---|
Study Design | Compares individuals with a condition (cases) to those without (controls). | Follows groups of individuals over time, observing exposures and outcomes. | Measures a population at a single point in time, observing various variables. |
Main Focus | Identifying exposures or risk factors associated with an outcome. | Observing how exposures lead to outcomes over time. | Examining the prevalence of variables or conditions at a given time. |
Temporal Sequence | Retrospective –looks back in time to find past exposures. | Prospective –follows participants forward in time. | No temporal sequence – snapshot of a population at a single time point. |
Data Collection | Collects past data (often using medical records or interviews). | Collects data over time, often requiring repeated observations or surveys. | Collects data at one point in time. |
\(\star\) Key Differences: Case-Control looks at data in the past, Cohort follows the data, and Cross-Sectional looks at data at one point in time.
Aspect | Case-Control | Cohort (Longitudinal) | Cross-Sectional |
---|---|---|---|
Strengths | Good for studying rare diseases, cost-effective, relatively quick. | Can establish temporal relationships, good for studying causes and effects. | Quick, inexpensive, good for identifying associations. |
Limitations | Cannot establish causality, relies on recall bias. | Expensive, time-consuming, and prone to participant attrition. | Cannot determine causality, only associations. |
\(\star\) Key Similarities: The limitation of observational studies is that it can not determine causality, only associations.
Study Type | Description | Strengths | Limitations |
---|---|---|---|
Prospective Study | Researchers follow subjects forward in time, starting with an exposure and observing future outcomes. | Can establish a temporal relationship between exposure and outcome, reduces recall bias. | Expensive, time-consuming, potential participant dropout. |
Retrospective Study | Researchers analyze past data, identifying subjects with an outcome and looking back to determine exposure. | Quick, cost-effective, useful for rare diseases or long-term effects. | Prone to recall bias, missing or incomplete data, cannot establish causality. |
\(\star\) Key Differences: Prospective means present and future data and retrospective means the past data.
Is there a relationship between hypertension and stroke incidence in an older population?
Study Design:
Findings:
\(\star\) This is an example of a retrospective cohort Study because the data is in the past and the design involves groups.
Does energy gels make a person run faster?
Study Design:
Findings:
\(\star\) This is an example of an experimental study because the design involves an intervention, which is the treatment group (with intervention) and compared it to the control group (without intervention).
Does energy gels make a person run faster? Since it is suspected that energy gels might affect pro and amateur athletes differently, we block for pro status.
Study Design:
Findings:
\(\dagger\) Why is is blocking important? Can you think of other variables to block for?
\(\star\) Since this is an experimental study, we can conclude a causal relationship between use of energy gels and faster running.
Principle | Description |
---|---|
Control | Compare treatment of interest to a control group. |
Randomize | Randomly assign subjects to treatments, and randomly sample from the population whenever possible. |
Replicate | Within a study, replicate by collecting a sufficiently large sample. Or replicate the entire study. |
Block | If there are variables that are known or suspected to affect the response variable, first group subjects into blocks based on these variables, and then randomize cases within each block to treatment groups. |
\(\star\) Key Idea: Experimental studies establish a cause-and-effect relationship by manipulating independent variables and observing their impact on dependent variables while controlling for confounding factors.
Aspect | Blocking | Explanatory |
---|---|---|
Definition | Characteristics that experimental units come with and that we want to control for. | Variables that we manipulate or observe to explain the outcome of the experiment. |
Purpose | Used to reduce variability by grouping experimental units with similar traits. | Used to explore or test the effect of a treatment or intervention on outcomes. |
Role in Experiment | Serve as a way to control for potential confounders and reduce bias. | Act as the independent variable(s) whose effect on the dependent variable is tested. |
Timing in Experiment | Applied before random assignment to ensure balanced groups. | Manipulated or measured during the experiment to observe their effect. |
\(\star\) Key Idea: Explanatory variables are factors tested for their impact, while blocking groups subjects to reduce confounding effects.
[First name][Last initial]
.?
command. For example,
?iris
outputs information about the iris
data
frame.NA
values)..Rmd
to .html
,
then Submit your .Rmd
and the recently knitted
.html
to Moodle.\(\star\) Your report for this phase will be due on the next phase of the project. Your group and final data sets will be announced.