Pesticide Effects

Researchers want to know whether exposure to a neonicotinoid pesticide reduces honeybee survival rates.

They take two groups of bees raised under identical conditions:

Control Group \(A\): \(200\) bees with no pesticide exposure, where \(172\) survived; \[p_A = \frac{172}{200} = 0.86\]
Treatment Group \(B\): \(200\) bees exposed to the pesticide, where \(140\) survived; \[p_B = \frac{140}{200} = 0.70\]

Research questions:

Is there a real difference of proportions between the groups of is it just by chance?
Is the lower survival rate in the treatment group a real biological effect of the pesticide, or just due to sampling variability?

\(\star\) These two questions talks about the same goal of the experiment but with different approaches. The 1st question relates to confidence intervals, while the second question relates to hypothesis testing.

Parameter and Point Estimate

We would like to estimate the difference of proportions between the control group and treatment group.

What are the parameter of interest and the point estimate?

Parameter of interest:

The difference of the true proportions of the treatment and control groups. \[p_B - p_A\]

Point estimate:

Difference of proportion of the sampled treatment and control groups. \[\hat{p}_B - \hat{p}_A\]

Inference of Two Proportions

What is the true difference in proportions between the treatment and control groups?

Confidence Interval:

We can answer this research question using a confidence interval.
In general, confidence interval are written as \[\text{point estimate} \pm z^* \cdot \text{SE}.\]

Sampling distribution:

Assuming CLT, the sampling distribution of the difference of two proportions is a normal distribution with center \(\hat{p}_B - \hat{p}_A\), which is the point estimate.
Standard error (SE) of a difference of two proportions will be determined by assuming that we can approximate a binomial with a normal distribution as long CLT conditions hold.

CLT for Two Proportions

We can use the normal approximation of the Binomial to simplify the sampling distribution of the difference of two sample proportions.

CLT Conditions:

Each sample is independent
Identically distributed observations with fixed population parameters \(p_A\) and \(p_B\)
Both population distributions have finite variances \(p_A(1-p_A)\) and \(p_B(1-p_B)\)
At least \(10\) “success” and \(10\) “failure” for both groups \(A\) and \(B\)

\(\star\) Having \(10\) as the minimum number of “success” and “failure” is a rule of thumb, but if more samples can be obtained, the better.

Normal approximation:

Difference for two sample proportions will be nearly normally distributed with: \[ \begin{aligned} \overline{x} & \approx \hat{p}_B - \hat{p}_A \\ s^2 & \approx \hat{p}_A(1-\hat{p}_A) + \hat{p}_B(1-\hat{p}_B) \end{aligned} \]

Standard error:

The general formula for the standard error (assuming CLT) is \[SE = \frac{s}{\sqrt{n}}.\]
So, for two proportions, the standard error is \[ \begin{aligned} SE_{\hat{p}} & = \frac{\sqrt{\hat{p}_A(1-\hat{p}_A)}}{\sqrt{n}} + \frac{\sqrt{\hat{p}_B(1-\hat{p}_B)}}{\sqrt{n}} \\ SE_{\hat{p}} & = \sqrt{\frac{\hat{p}_A \left( 1-\hat{p}_A \right)}{n_A} + \frac{\hat{p}_B \left( 1-\hat{p}_B \right)}{n_B}} \end{aligned} \]

Inferring the True Difference of Two Proportions

The results of the experiment yeilded a difference of sample proportions \(\hat{p}_B - \hat{p}_A = \frac{172}{200} - \frac{140}{200} = 0.86 - 0.70 = 0.16\).

Information given:

Estimate using a \(90\)% confidence interval. This is a confidence level of \(0.90\).
Given: \(n_A = 200\), \(\hat{p}_A = \frac{172}{200} = 0.86\), and \(n_B = 200\), \(\hat{p}_B = \frac{140}{200} = 0.70\). First check conditions:
- Independence: The sample is random, and \(200\) for both groups. We can assume independence here since the population of interest are bees, which is assume to be in millions.
- Success-failure: We have \(172\) “success” and \(28\) “failure” in group \(A\), and We have \(140\) “success” and \(60\) “failure” in group \(B\)
The CLT conditions hold. So, we can use a normal approximation of the sampling distribution of \(\hat{p}_B - \hat{p}_A\).

Confidence interval:

For a \(0.90\) confidence level, \(z^* = 1.645\).
The interval is calculated as \[ \begin{aligned} \hat{p}_B - \hat{p}_A & \pm z^* \cdot SE_{\hat{p}_B - \hat{p}_A} \\ \hat{p}_B - \hat{p}_A & \pm z^* \cdot \sqrt{\frac{\hat{p}_A \left( 1-\hat{p}_A \right)}{n_A} + \frac{\hat{p}_B \left( 1-\hat{p}_B \right)}{n_B}} \\ 0.16 & \pm 1.645 \cdot \sqrt{\frac{0.70 \left( 1-0.70 \right)}{200} + \frac{0.86 \left( 1-0.86 \right)}{200}} \end{aligned} \]
The \(90\)% confidence interval is \((0.093,0.227)\).

Using R:

z_star <- qnorm(0.90+((1-0.90)/2),0,1) # critical value
n_A <- 200 # group A sample size
n_B <- 200 # group B sample size
p_hat_A <- 140/n_A # group A sample proportion
p_hat_B <- 172/n_B # group B sample proportion
p_diff <- p_hat_B-p_hat_A # difference in sample proportions (point estimate)
SE_diff <- sqrt(((p_hat_A*(1-p_hat_A))/n_A) + ((p_hat_B*(1-p_hat_B))/n_B)) # standard error
cl_lb <- p_diff - z_star*SE_diff # upper bound
cl_ub <- p_diff + z_star*SE_diff # lower bound
c(cl_lb,cl_ub) # interval as an ordered list

## [1] 0.09314525 0.22685475

Interpretation of the Confidence Interval

The point estimate for the true difference in proportions of the treatment and control groups is \(\hat{p}_B - \hat{p}_A = 0.86 - 0.70 = 0.16\) with standard error \(SE_{\hat{p}_B - \hat{p}_A} \approx 0.041\). For a \(0.90\) confidence level, \(z^* \approx 1.645\).

\(\star\) This interval answers “Is there a real difference of proportions between the groups of is it just by chance?” because it estimates the difference with some level of uncertainty.

Confidence interval:

We are \(90\)% confident that the true difference in proportions of the treatment and control groups is between \(0.093\) and \(0.227\).
If there were no actual difference, then \(\hat{p}_B - \hat{p}_A = 0\), a null value. Since \(0\) difference is not within the \(90\)% confidence interval, then we can say that the difference is significant.

Interpretation:

If we repeat this experiment multiple times and compute the point estimate and the confidence interval, \(90\)% of the intervals will contain the true difference in proportions \(p_B - p_A\).

Summary of Parameter Estimation for Two Proportions

Let there be two independent groups \(A\) and \(B\).

CLT conditions:

Each sample is independent
Identically distributed observations with a fixed population parameters \(p_A\) and \(p_B\)
Population distribution have finite variance \(p_A(1-p_A)\) and \(p_B(1-p_B)\)
Success-failure outcomes is \(n_Ap_A \ge 10\), \(n_A(1-p_A) \ge 10\), \(n_Bp_B \ge 10\), and \(n_B(1-p_B) \ge 10\)

Sampling distribution of the point estimate:

Confidence interval:

\[\hat{p}_{diff} \pm z^* \cdot SE_{\hat{p}_{diff}}\]

The difference in sample proportions (point estimate) is \[\hat{p}_{diff} = \hat{p}_B - \hat{p}_B\] where \(\hat{p}_A\) and \(\hat{p}_B\) are the sample proportions for groups \(A\) and \(B\), respectively.
The samples sizes for groups \(A\) and \(B\) are \(n_A\) and \(n_B\) respectively.
The standard error is \[SE_{\hat{p}_{diff}} = \sqrt{\frac{\hat{p}_A \left( 1-\hat{p}_A \right)}{n_A} + \frac{\hat{p}_B \left( 1-\hat{p}_B \right)}{n_B}}.\]
The critical z-score for a given confidence level is \(z^*\).
The margin of error is \(ME = z^* \cdot SE_{\hat{p}_{diff}}\).

\(\dagger\) Use the qnorm function in R to compute \(z^*\).

Bee Survival

Now, we explore the answer to the research question “Is the lower survival rate in the treatment group a real biological effect of the pesticide, or just due to sampling variability?”

Data:

This study is an experiment where we try to determine whether pesticides reduces the chance of survival of honeybees.
The point estimate is the differnce in sample proportions \(\hat{p}_B - \hat{p}_A = \frac{172}{200} - \frac{140}{200} = 0.16\) with sample sizes of \(n_A=200\) and \(n_B = 200\).
Note that group \(B\) is the treatment group (with pesticide) and group \(A\) is the control group (no pesticide).

Objective:

The sample difference is positive, meaning for a difference \(\hat{p}_B - \hat{p}_A > 0\), group \(B\) has the larger value.
We need to use hypothesis testing to determine whether their is a significant difference in proportions between the treatment and control groups.

Define Hypothesis

Let \(p_B - p_A\) represent the true difference in proportions between the treatment and control groups.

Null hypothesis \(H_0\): There is no difference in proportions between the treatment and control groups (there is no effect of the pesticide to the survival of the honeybees).

\[p_B - p_A = 0\]

Significance level: A significance level of \(\alpha = 0.10\) is chosen.

\(\dagger\) The significance level \(\alpha=0.10\) is consistent with our earlier analysis with confidence level \(0.90\) because confidence level is \(1-\alpha\).

Alternative hypothesis \(H_A\): There is a difference in proportions between treatment and control groups (there is an effect of the pesticide to the survival of the honeybees).

\[p_B - p_A > 0\]

\(\star\) This is a one-tailed test because the \(H_A\) is using the \(>\) sign.

Compute the Test Statistic

The point estimate is the difference in sample proportions \(\hat{p}_B - \hat{p}_A = \frac{172}{200} - \frac{140}{200} = 0.16\).

Test statistic for two proportions:

\[z = \frac{(\hat{p}_B - \hat{p}_A) - 0}{SE_{\hat{p}_B - \hat{p}_A}}\]

The \(0\) term in the test statistic is the null value \(p_0\).
\(SE_{\hat{p}_B - \hat{p}_A} = \sqrt{\hat{p}_{pool}\left( 1-\hat{p}_{pool} \right)\left( \frac{1}{n_A} + \frac{1}{n_B} \right)}\) is the pooled standard error.
\(\hat{p}_{pool} = \frac{n_A \hat{p}_A + n_B \hat{p}_B}{n_A + n_B}\) is the pooled proportion.
\(n_A\) and \(n_B\) are the sample sizes for group \(A\) and \(B\) respectively.

Computing the test statistic:

\[ \begin{aligned} \hat{p}_{pool} & = \frac{n_A \hat{p}_A + n_B \hat{p}_B}{n_A + n_B} \\ & = \frac{140 + 172}{200 + 200} \\ \hat{p}_{pool} & = 0.78 \end{aligned} \]

\[ \begin{aligned} z & = \frac{0.16}{\sqrt{0.78\left( 1-0.78 \right)\left( \frac{1}{200} + \frac{1}{200} \right)}} \\ z & \approx 3.862 \end{aligned} \]

\(\star\) The pooled standard error in two-proportion inference provides a more precise, singular estimate of the population proportion under the null hypothesis. It forces consistency with the assumption that the two populations are identical, creating a more stable and accurate test statistic

Determine the P-value

Determine the probability associated with the computed test statistic. Remember that this is the probability \(P(Z \ge z|H_0)\), where \(Z\) is an r.v. with the standard normal distribution.

Sampling distribution of the null value (normalized):

Using R:

n_A <- 200 # group A sample size
n_B <- 200 # group B sample size
p_A <- 140/n_A # group A sample proportion
p_B <- 172/n_B # group B sample proportion
p_pool <- (140+172)/(n_A+n_B) # pooled proportion
p_diff <- p_B - p_A # sample difference (point estimate)
p_0 <- 0 # null value
SE_pool <- sqrt(p_pool*(1-p_pool)*(1/n_A + 1/n_B)) # pooled standard error
z <- (p_diff-p_0)/SE_pool # test statistic

# p-value
1-pnorm(z,0,1)

## [1] 5.61309e-05

\(\star\) The p-value is the probability \(P(Z \ge z|H_0) \approx 0.000056\) (practically \(0\)). Since this is one-tailed test, we only use the right tail probability.

Make a Decision and Conclusion

We compare the p-value to our chosen significance level of \(\alpha = 0.10\).

Choices:

If \(\text{p-value} < \alpha\), reject \(H_0\); there is enough evidence to support that there is an actual difference in proportions.
If \(\text{p-value} \ge \alpha\), do not reject \(H_0\); there is not enough evidence to support that there is an actual difference in proportions.

Conclusions:

We have a p-value of \(\approx 0\).
Since \(0 < 0.10\), we reject \(H_0\).

Interpretation of the Hypothesis Test

The hypothesis test concluded that we reject \(H_0\).

Context:

We conducted an experiment to see of pesticides have an effect ont eh survival of the bees.
We grouped the beed into control (no pesticide) and treatment (with pesticide) groups and computed the proportion who survived.

Interpretation:

Since we rejected the null, our claim that pesticides have an effect in survival is significant.
There is enough evidence to support our claim.

\(\star\) Note that this is an experiment, albeit a very simple one. So, we can conclude a causation, where pesticides can cause lower survival rates of the honeybees.

Confidence Interval in Relation to Hypothesis Testing

Earlier, We computed a \(90\)% confidence interval of the difference of sample proportions (point estimate) \(\hat{p}_B - \hat{p}_A = \frac{172}{200} - \frac{140}{200} = 0.16\)..

Confidence Level:

If we set a significance level \(\alpha = 0.10\), then the confidence level for the sample proportion is \(1-\alpha = 1 - 0.10 = 0.90\).
The critical z-value of a \(0.90\) confidence level is \(z^* = 1.645\).

Confidence Interval:

The \(90\)% confidence interval is \(0.16 \pm 1.645 \cdot 0.041\) or \((0.093,0.227)\).

\(\star\) The null value of \(0\) is not within the \(90\)% confidence interval. We would reject the null hypothesis at the \(10\)% significance level.

Summary of Hypothesis Testing for Two Proportions (1/2)

Let \(p_A\) and \(p_B\) be the population parameters for groups \(A\) and \(B\) respectively and \(p_0\) (difference of two proportions) the null value.

State the Hypotheses:

Null Hypothesis \(H_0\): The difference in population proportion remains unchanged. \[p_B - p_A = p_0\]
Alternative Hypothesis \(H_A\): The difference in population proportion has changed. \[p_B - p_A \ne p_0\]

\(\dagger\) The alternative hypothesis can be \(\ne\) (two-sided) and \(<\) or \(>\) (one-sided) depending on context. \(\dagger\) Usually the null value is \(p_0 = 0\) for the null hypothesis of “no difference” or “no effect”.

Set Significance Value \(\alpha\):

Common values are \(\alpha = 0.10, 0.05, 0.01\).
Note that \(\alpha\) is the Type I error rate.

\(\star\) The significance value has to be set before looking at the p-value.

Summary of Hypothesis Testing for Two Proportions (2/2)

Compute the test statistic:

\[z = \frac{\left(\hat{p}_B-\hat{p}_A\right)-p_0}{SE_{p_B - p_A}}\]

\(\hat{p}_A\) is the point estimate for group \(A\)
\(\hat{p}_B\) is the point estimate for group \(B\)
\(SE_{\hat{p}_B - \hat{p}_A} = \sqrt{\hat{p}_{pool}\left( 1-\hat{p}_{pool} \right)\left( \frac{1}{n_A} + \frac{1}{n_B} \right)}\) is the standard error
\(\hat{p}_{pool} = \frac{n_A \hat{p}_A + n_B \hat{p}_B}{n_A + n_B}\)

Determine the p-value:

If one-sided test:
- Find \(P(Z \le z | H_0)\) for left tail
- Find \(1-P(Z \ge z | H_0)\) for right tail
If two-sided test:
- Find \(2 \cdot P(Z \le z | H_0)\) or \(2 \cdot (1-P(Z \ge z | H_0))\)
Note that \(Z \sim N(0,1)\) is an r.v. with the standard normal distribution.

\(\dagger\) Use the pnorm function in R to compute the p-value.

Sampling distribution of the null value (left one-tail):

Sampling distribution of the null value (right one-tail):

Sampling distribution of the null value (two-tail):

Summary of Hypothesis Testing for One Proportion (3/3)

Make a decision and conclusion:

Reject \(H_0\) if the \(\text{p-value} < \alpha\): There is enough evidence to support \(H_A\) that there is a significant difference in sample proportions under the null hypothesis.
Fail to reject the \(H_0\) if the \(\text{p-value} \ge \alpha\): There is not enough evidence to support \(H_A\) that there is a significant difference in sample proportions under the null hypothesis.

Important Notes:

\(\star\) If you rejected \(H_0\), it does not mean that \(H_0\) is immediately false. It means that the observation is a rare occurrence under the assumption that it came from the null value’s sampling distribution.

\(\star\) If you failed to reject \(H_0\), it does not mean that the \(H_0\) is “accepted”. It means that the observation just happened by chance due to sampling variability.

Inference for Two Proportions

Applied Statistics

Objectives

Pesticide Effects

Parameter and Point Estimate

Inference of Two Proportions

CLT for Two Proportions

Inferring the True Difference of Two Proportions

Interpretation of the Confidence Interval

Summary of Parameter Estimation for Two Proportions

Bee Survival

Define Hypothesis

Compute the Test Statistic

Determine the P-value

Make a Decision and Conclusion

Interpretation of the Hypothesis Test

Confidence Interval in Relation to Hypothesis Testing

Summary of Hypothesis Testing for Two Proportions (1/2)

Summary of Hypothesis Testing for Two Proportions (2/2)

Summary of Hypothesis Testing for One Proportion (3/3)