Chi-Squared Tests

Applied Statistics

MTH-361A | Spring 2025 | University of Portland

March 21, 2025

Objectives

Develop an understanding of inference for contingency tables
Know how to compute the chi-square test statistic
Practice on applying the chi-square test
Activity: Test for Independence

These slides are derived from Diez et al. (2012).

Previously… (1/3)

Confidence Interval for One Proportion

\[\hat{p} \pm z^{\star} \text{SE}_{\hat{p}}\]

\[ \begin{aligned} \hat{p} & \longrightarrow \text{sample proportion (or the point estimate)} \\ z^{\star} & \longrightarrow \text{critical z-score at a given confidence level} \\ \text{SE}_{\hat{p}} & \longrightarrow \text{standard error of the sampling distribution} \\ \end{aligned} \]

Previously… (2/3)

Hypothesis Testing for One Proportion

\[ \begin{aligned} p & \longrightarrow \text{population proportion} \\ \hat{p} & \longrightarrow \text{sample proportion (or the point estimate)} \\ H_0: p = p_0 & \longrightarrow \text{null hypothesis} \\ H_A: p \ne p_0 & \longrightarrow \text{alternative hypothesis (can be } < \text{ or } > \text{)} \\ z & \longrightarrow \text{test statistic} \\ \text{SE}_{p} & \longrightarrow \text{standard error of the null distribution} \\ \end{aligned} \]

Previously… (3/3)

Relationship Between Variables

\[\text{explanatory variable} \xrightarrow{\text{might affect}} \text{response variable}\]

Associated vs Independent Variables

When two variables show some connection with one another, they are called associated or dependent variables.
In general, association does not imply causation, and causation can only be inferred from a randomized experiment.

Example 1

The contingency table below shows the distribution of survival and different classes of passengers on the Titanic.

	1st	2nd	3rd	crew	Sum
no	123	166	528	679	1496
yes	201	118	181	211	711
Sum	324	284	709	890	2207

Goals:

To analyze whether the observed frequencies of class differ significantly from the expected frequencies of survived.
To test whether class and survived are independent by comparing observed and expected frequencies in a two-way table.

The Chi-Squared Test

The Chi-Square statistic is used in hypothesis testing to determine whether observed data differs significantly from expected data. Commonly used for categorical data analysis.

\[\chi^2 = \sum \frac{(O-E)^2}{E}\]

\(O\) is the observed frequency
\(E\) is the expected frequency

The degrees of freedom (df) in a chi-square test depend on the type of test being conducted:

For a One-Way Table: \(df = r - 1\) where \(r\) is the number of categories.
For a Two-Way Table: \(df = (r - 1) \times (c - 1)\) where \(r\) is the number of rows and \(c\) is the number of columns.

Example 1

1st Class Distribution

\[ \begin{aligned} H_0: & \text{The observed distribution of survived among 1st class passengers follows the expectation.} \\ H_A: & \text{The observed distribution of survived among 1st class passengers differs from the expectation.} \end{aligned} \]

	1st
no	123
yes	201

The expected proportions are \(\frac{1}{2}\) assuming the passengers are equally likely to survive.
The expected frequency based on the data is \(\left( \frac{1}{2} \right) 324 = 162\) for each survived category.

Set \(\alpha = 0.05\) and compute the \(\chi^2\) statistic.

\[ \begin{aligned} \chi^2 & = \frac{(123 - 162)^2}{162} + \frac{(201 - 162)^2}{162} \\ & = 18.78 \end{aligned} \]

Determine degrees of freedom, which is \(df = 1\).
Compute the p-value using R.

df <- 1 # define degrees of freedom
chisq <- 18.78 # set chi-square statistic
1-pchisq(18.78,df) # compute the p-value

## [1] 1.466974e-05

Since the p-value is less than \(\alpha\), then we reject \(H_0\). We can conclude that we have enough evidence to support \(H_A\).

Example 1: The Chi-Square Sampling Distribution

Example 2: Inference for One-Way Table

2nd Class Distribution

\[ \begin{aligned} H_0: & \text{The observed distribution of survived among 2nd class passengers follows the expectation.} \\ H_A: & \text{The observed distribution of survived among 2nd class passengers differs from the expectation.} \end{aligned} \]

	2nd
no	166
yes	118

The expected proportion are \(\frac{1}{2}\) assuming the passengers are equally likely to survive.
The expected frequency based on the data is \(\left( \frac{1}{2} \right) 284 = 142\) for each survived category.

\(\dagger\) Determine the \(\chi^2\) test statistic and the p-value. What is your conclusion?

Example 3

	1st	2nd	3rd	crew	Sum
no	123	166	528	679	1496
yes	201	118	181	211	711
Sum	324	284	709	890	2207

\[ \begin{aligned} H_0: & \text{There is no association between class and survived. The variables are independent.} \\ H_A: & \text{There is an association between class and survived. The variables are dependent.} \end{aligned} \]

Example 3

Consider the following problem description.

Students in grades 4-6 were asked whether good grades, athletic ability, or popularity was most important to them. A two-way table separating the students by grade and by choice of most important factor is shown below. Do these data provide evidence to suggest that goals vary by grade?

	Grade	Popular	Sports
4th	63	31	23
5th	88	55	33
6th	96	55	32

Source: Popular Kids Dataset. This is from a 1992 study and was revisited 30 years later.

Example 3: The Chi-Squared Test for Independence (1/2)

The null and alternative Hypothesis \[H_0: \text{Grade and goals are independent. Goals do not vary by grade.}\] \[H_A: \text{Grade and goals are dependent. Goals vary by grade}\]

Example 3: The Chi-Squared Test for Independence (2/2)

The Chi-Squared test statistic \[\chi^2_{k} = \sum_{i=1}^n \frac{(O_i - E_i)^2}{E_i}\]
- \(O_i\) is the number of observations of type \(i\)
- \(E_i\) is the expected frequency of type \(i\)
- \(n\) is the number of cells in the table
- \(k = (R-1)(C-1)\) is the degrees of freedom where \(R\) is the number of rows and C is the number of columns.

Example 3: Computing the \(\chi^2\) statistic - Expected Frequency (1/3)

Start with the expected frequency of type \(i\)

	Grade	Popular	Sports	Total
4th	\(\color{blue}{63}\)	\(\color{orange}{31}\)	23	119
5th	88	55	33	176
6th	96	55	\(\color{red}{32}\)	183
Total	247	141	90	478

Note: Color corresponds to the cell and we are rounding to the nearest integer for computing the expected frequencies.

\[\color{blue}{E_{4th,Grade} = \frac{(119)(247)}{478} = 61}\] \[\color{orange}{E_{4th,Popular} = \frac{(119)(141)}{478} = 35}\] \[\vdots\] \[\color{red}{E_{6th,Sports} = \frac{(183)(90)}{478} = 34}\]

Example 3: Computing the \(\chi^2\) statistic - Expected Frequency (2/3)

Question - What is the expected count for the highlighted cell?

	Grade	Popular	Sports	Total
4th	63	31	23	119
5th	88	\(\color{green}{55}\)	33	176
6th	96	55	32	183
Total	247	141	90	478

\[\color{green}{E_{5th,Popular} = \frac{(176)(141)}{478} = 52}\]

Example 3: Computing the \(\chi^2\) statistic - Expected Frequency (3/3)

The expected frequency for each \(\color{blue}{[cell]}\).

	Grade	Popular	Sports	Total
4th	63 \(\color{blue}{[61]}\)	31 \(\color{blue}{[35]}\)	23 \(\color{blue}{[23]}\)	119
5th	88 \(\color{blue}{[91]}\)	55 \(\color{blue}{[52]}\)	33 \(\color{blue}{[33]}\)	176
6th	96 \(\color{blue}{[95]}\)	55 \(\color{blue}{[54]}\)	32 \(\color{blue}{[34]}\)	183
Total	247	141	90	478

Example 3: Computing the \(\chi^2\) statistic

The \(\chi^2\) statistic. \[\chi^2_{k} = \frac{(63-61)^2}{61} + \frac{(31-35)^2}{35} + \cdots + \frac{(32-34)^2}{34} = 0.967\]
Degrees of freedom. \[k = (3-1) \times (3-1) = 2(2) = 4\]

Example 3: Computing the p-value

\(\chi^2_{k} = 1.3121\) and \(k = 4\)
We can use the pchisq function in R.

df <- 4
1-pchisq(0.967,df)

## [1] 0.9147579

The p-value of 0.9148.

Note that in a chi-squared analysis, the p-value is the probability of obtaining a chi-square as large or larger than that in the current experiment.

Example 3: Conclusion

Do these data provide evidence to suggest that goals vary by grade? \[H_0: \text{Grade and goals are independent. Goals do not vary by grade.}\] \[H_A: \text{Grade and goals are dependent. Goals vary by grade}\]
Since the p-value is large, we fail to reject \(H_0\). The data do not provide convincing evidence that grade and goals are dependent. It doesn’t appear that goals vary by grade.

Summary: Steps for \(\chi^2\) Tests

Compute the expected values.
Set the significance value \(\alpha\).
Compute the \(\chi^2\) test statistic and the degrees of freedom \(df\).
Determine the p-value using the \(\chi^2\) test statistic.
Make a conclusion.

\(\star\) Key Idea: The chi-square test assumes independent categorical data with sufficiently large expected counts and compares observed vs. expected frequencies to assess whether deviations are due to chance.

Activity: Test Independence for Two-Way Tables

Make sure you have a copy of the F 3/21 Worksheet. This will be handed out physically and it is also digitally available on Moodle.
Work on your worksheet by yourself for 10 minutes. Please read the instructions carefully. Ask questions if anything need clarifications.
Get together with another student.
Discuss your results.
Submit your worksheet on Moodle as a .pdf file.

References

Diez, D. M., Barr, C. D., & Çetinkaya-Rundel, M. (2012). OpenIntro statistics (4th ed.). OpenIntro. https://www.openintro.org/book/os/

Speegle, Darrin and Clair, Bryan. (2021). Probability, statistics, and data: A fresh approach using r. Chapman; Hall/CRC. https://probstatsdata.com/