The Central Limit Theorem (CLT)

Definition:

Let \(X_1,X_2, \cdots X_n\) be i.i.d. r.v.s, which denote a statistical sample of size \(n\) from a population distribution with finite positive variance.
Let \(\overline{X}_n\) a statistical measurement (i.e. sample mean), which is itself an r.v..
The limit as \(n \to \infty\) of the distribution of \(\overline{X}_n\) is a normal distribution.

\(\star\) The probability distribution of \(\overline{X}_n\) is called the sampling distribution.

The Sampling Distribution

Definition:

A probability distribution of a statistic.
A statistic can be a proportion, mean, median, standard deviation, \(\cdots\) etcetera.

CLT Conditions:

Independence: Sample values must be independent
Identical distribution: Samples should be from the same distribution with a fixed parameter
Finite variance: The population must have a finite variance
Large sample size: A larger sample size improves approximation

The Normal Distribution:

The normal distribution is a type of a sampling distribution as long the CLT conditions hold.

\(\star\) With a sufficiently large sample size, the mean of the sampling distribution is approximately equal to the corresponding parameter of the underlying population.

The CLT Formula

The CLT is very useful because it tell us that the sampling distribution of the sample mean is normal, regardless of the population’s distribution shape.

Sampling distribution of the sample mean:

For large sample size \(n\), the sampling distribution of the sample mean \(\overline{X}_n\) is approximately normal.
Its expected value is the population mean: \(\displaystyle \text{E}(\overline{X}_n) = \mu\).
The variance is by a factor of \(\frac{1}{n}\) of the population variance: \(\displaystyle \text{Var}(\overline{X}_n) = \frac{\sigma^2}{n}\)
As we increase \(n\), the variance \(\text{Var}(\overline{X}_n)\) approaches zero, meaning the sample mean \(\text{E}(\overline{X}_n)\) approaches \(\mu\).

The CLT z-score:

Under CLT, \(\mu\) is the mean and \(\frac{\sigma^2}{n}\) is the variance.

We standardize the sample mean using the z-score formula: \[z = \frac{\overline{x} - \mu}{\frac{\sigma}{\sqrt{n}}}\]

\(\star\) The sample standard deviation \(s\) is used when the population standard deviation \(\sigma\) is unknown.

The Sampling Distribution Under CLT

\(\star\) The distribution remains the same except that the standard deviations of this normal distribution is multiplied by a factor of \(\frac{1}{\sqrt{n}}\).

Mean Annual Household Income (1/2)

Suppose that we want to estimate the mean annual income of the city of Fancyland.

Data:

Here, we show a distribution of \(140\) annual household incomes. The numbers are in thousands.

Let \(\overline{x} = 245\) be the estimate.

\(\dagger\) The sample mean \(\overline{x}\) is known as an unbiased estimator of the population mean \(\mu\) because \(\text{E}(\overline{X}) = \mu\) over repeated sampling.

Sampling distribution:

Suppose that we sample \(140\) annual incomes again from the same city and record the mean annual income and repeat this multiple times.
The sample means will vary but we can assume that the sampling distribution of these means is approximately normal.
Even though the annual income data are right-skewed, the sampling distribution of the sample mean is approximately normal under the CLT.

\(\star\) We can’t take data multiple times due to practical constraints. That is why we assume CLT as long as the conditions hold.

Mean Annual Household Income (2/2)

Suppose we want to estimate the variance of the sampling distribution of the mean annual income.

Information from the data:

The sample mean is \(\overline{x} = 245\).
The sample standard deviation is \(s = 226\).
The sample size is \(n = 140\).

The unknowns:

We don’t know the population mean \(\mu\) (true mean annual income).
We don’t know the population variance \(\sigma^2\) (true variance of the annual incomes).

Best Estimate:

Since \(\sigma^2\) is unknown, our best estimate for the variance is \(s^2\), the sample variance.
So, under CLT, \(s^2 = 226^2\) is the variance estimate of the sampling distribution of the sample mean.

\(\dagger\) Similar to the sample mean, the sample variance \(s^2\) is also an unbiased estimator of the population variance \(\sigma^2\) because \(\text{Var}(\overline{X}) = \sigma^2\) over repeated sampling.

Parameter Estimation

The mean annual income \(\overline{x} = 245\) is an estimate for an unknown parameter, but how good is this estimate?

Why CLT matters for parameter estimation:

The mean annual income \(\overline{x} = 245\), the observation itself, exhibits normality of its own sampling distribution assuming CLT.
The standard deviation of this sampling distribution is \(\displaystyle \frac{s}{\sqrt{n}} = \frac{226}{\sqrt{140}} \approx 19.1\).
A confidence interval uses the sample mean and standard deviation to estimate a range, about \(214\) to \(227\) for a \(0.95\) confidence level.

Sampling distribution of the observation:

\(\star\) The \(0.95\) confidence level is the long-run probability that intervals constructed from repeated samples will contain the true population mean \(\mu\).

Hypothesis Testing (1/3)

The mean annual income \(\overline{x} = 245\) is an observation from the data, but how far is this observation from a null value?

Why it matters for hypothesis testing:

The null hypothesis is a statement about a population parameter that represents a default assumption, which we assume as true.
Using CLT, we can compute the probability of observing a sample mean as extreme as \(\overline{x} = 245\).
This probability is called the p-value.

Sampling distribution of the null value:

\(\star\) The p-value answers the question: If the true mean were \(\mu = 225\), how likely is it to observe a sample mean this far (or farther) from \(\mu\) just by random chance?

Hypothesis Testing (2/3)

The null hypothesis:

It states that the population mean is equal to a specified value \(\mu = \mu_0\).
Suppose that the true mean annual income is \(\mu_0 = 225\).
So, \(H_0: \mu = 225\).
In context: Note that we actually don’t know the true mean annual income. We are just assuming prior knowledge that Fancyland is that fancy.

The alternative hypothesis:

Since our observation is \(\overline{x} = 245\), we offer an alternative hypothesis (or claim) that the true mean is greater than the null value \(\mu_0 = 225\).
So, \(H_A: \mu > 225\).
In context: We claim that Fancyland is fancier than originally thought.

The p-value:

To compute the p-value you need to assume CLT, meaning all of its conditions are assumed to hold.
We compute the z-score before computing the p-value, which we call the test statistic: \[ \begin{aligned} z & = \frac{\overline{x} - \mu_0}{s / \sqrt{n}}\\ & = \frac{245 - 225}{226 / \sqrt{140}} \\ z & \approx 1.047 \end{aligned} \]
The p-value is \(P(Z \ge z | H_0) = 0.148\), where \(Z\) is an r.v. with the standard normal distribution.

Using R:

1-pnorm(1.047,0,1)

## [1] 0.1475498

\(\star\) Note the conditional probability notation \(P(Z \ge z | H_0)\) for the p-value, which is not the same as \(P(H_0 | Z \ge z)\).

Hypothesis Testing (3/3)

Interpreting the p-value:

There is a \(0.148\) probability of observing a sample mean at least as large as \(\overline{x} = 245\), given the \(H_0\) is true.
Is this enough evidence? There is not a sufficient evidence to support \(H_A: \mu > 225\). The p-value is “relatively large”, indicating that the observed result is not unusual under \(H_0: \mu = 225\). The observation we obtained likely occurred due to random chance (or sampling variability).
In context: There is no sufficient evidence to support our claim that Fancyland is fancier than originally thought. The city of Fancyland remains fancy but not too fancy.

\(\star\) Deciding whether we have sufficient evidence (to “reject” or “fail to reject” the null hypothesis) requires a significance level, a probability that we choose in advance and use to compare with the p-value.

Central Limit Theorem

Applied Statistics

Objectives