graph TD
A[Population 1] --> B[Sample 1]
A --> C[Sample 2]
B --> D[Mean 1]
C --> E[Mean 2]
D --> F[Difference]
E --> F
graph TD
A[Population 1] --> B[Sample 1]
A --> C[Sample 2]
B --> D[Mean 1]
C --> E[Mean 2]
D --> F[Difference]
E --> F
\[ \bar{X}_t - \bar{X}_c \]
\[ E(\bar{X}_t - \bar{X}_c) = E(\bar{X}_t) - E(\bar{X}_c) = \mu_t - \mu_c \]
\[ Var(\bar{X}_t - \bar{X}_c) = Var(\bar{X}_t) + (-1)^2Var(\bar{X}_c) = \frac{\sigma_t^2}{n_t} + \frac{\sigma_c^2}{n_c} \]
Now we know that:
\[ \bar{X}_t - \bar{X}_c \sim N(\mu_t - \mu_c, \sqrt{\frac{\sigma_t^2}{n_t} + \frac{\sigma_c^2}{n_c}}) \]
or that:
\[ \frac{\bar{X}_t - \bar{X}_c - (\mu_t - \mu_c)}{\sqrt{\frac{\sigma_t^2}{n_t} + \frac{\sigma_c^2}{n_c}}} \sim N(0, 1) \]
\[ (\bar{X}_t - \bar{X}_c) \pm z_{\alpha/2}\sqrt{\frac{\sigma_t^2}{n_t} + \frac{\sigma_c^2}{n_c}} \]
The null hypothesis for a two-tailed test is:
\[ H_0: \mu_t - \mu_c = D_0 \]
for some difference \(D_0\) (but usually 0)
The alternative hypothesis is:
\[ H_1: \mu_t - \mu_c \neq D_0 \]
The null hypothesis for a right-tailed test is:
\[ H_0: \mu_t - \mu_c <= D_0 \]
The alternative hypothesis is:
\[ H_1: \mu_t - \mu_c > D_0 \]
For a left-tailed test:
\[ H_0: \mu_t - \mu_c >= D_0 \]
\[ H_1: \mu_t - \mu_c < D_0 \]
The degrees of freedom for a difference in means is:
\[ df = \frac{(\frac{s_t^2}{n_t} + \frac{s_c^2}{n_c})^2}{\frac{(\frac{s_t^2}{n_t})^2}{n_t - 1} + \frac{(\frac{s_c^2}{n_c})^2}{n_c - 1}} \]
Why? Previously we just had \(n-1\).
The issue comes in adding up the variance of the two distributions
Previously the issue with the unknown \(\sigma\) problem was that the test statistic (or z-score transformation) was the ratio of a normal and a chi-squared distribution.
Now the test statistic is the ratio of a normal and a linear combination of chi-squared distributions (\(k_i s_i^2 + ... + k_n s_n^2\)).
In this case, we need to use an approximation called the Welch-Satterthwaite approximation
This gives an approximate chi-squared distribution for this kind of sum of standard deviations.
Note that if the sample sizes are equal, then the degrees of freedom are just \(n_t + n_c - 2\) or \(2n-2\).
\[ H_0: \mu_t - \mu_c = 0 \]
\[ H_1: \mu_t - \mu_c \neq 0 \]
\[ \bar{x}_t - \bar{x}_c = 110 - 120 = -10 \]
\[ \sqrt{\frac{s_t^2}{n_t} + \frac{s_c^2}{n_c}} = \sqrt{\frac{15^2}{20} + \frac{10^2}{20}} = 4.03 \]
The degrees of freedom are:
\[ \frac{(\frac{s_t^2}{n_t} + \frac{s_c^2}{n_c})^2}{\frac{(\frac{s_t^2}{n_t})^2}{n_t - 1} + \frac{(\frac{s_c^2}{n_c})^2}{n_c - 1}} \]
\[ \frac{(\frac{15^2}{22} + \frac{10^2}{20})^2}{\frac{(\frac{15^2}{22})^2}{22 - 1} + \frac{(\frac{10^2}{20})^2}{20 - 1}} = 32.635 \]
So we use the lower number for df, 32.
\(\approx 2.042\)
\[ \frac{\bar{x}_t - \bar{x}_c}{\sqrt{\frac{s_t^2}{n_t} + \frac{s_c^2}{n_c}}} = \frac{-10}{3.97} = -2.52 \]
is less than -2.01, so we reject the null hypothesis.
The p-value is:
\[ 2*P(T_{47} < -2.52) = 0.015 \]
\[ -10 \pm 2.01*3.97 = -10 \pm 8.0 \]
\[ 120 \pm 2.080*\frac{10}{\sqrt{22}} = 120 \pm 4.03 \]
\[ 110 \pm 2.093*\frac{15}{\sqrt{20}} = 110 \pm 6.6 \]
The confidence intervals cross, but the difference in means is still significant.
Why Worm Infections Reduce School Participation
Fatigue & anemia → Kids too weak to attend.
Stomach pain & diarrhea → Frequent absences.
Cognitive effects → Harder to focus & learn.
Community spread → More infections = fewer kids in school.
Long-Term Impacts (Follow-up Studies)
📈 Higher earnings & employment in adulthood.
📈 Treated children moved into higher-paying sectors.
Policy Takeaways
✅ Mass deworming is a highly cost-effective intervention.
✅ Health interventions can improve education outcomes.
✅ Spillover effects justify government & NGO funding for free treatment.
Health & Spillover Effects
✅ Infection rates dropped significantly in treated schools.
✅ Untreated children in treatment schools & nearby schools also benefited.
Education Outcomes
📉 Absenteeism fell by 25% in treated schools.
📉 Younger children saw the biggest gains in attendance.
❌ No significant improvement in test scores.
Cost-Effectiveness
💰 Cost per extra year of school: $3.50
💰 Deworming cheaper than school meals or cash transfers.
{.width=200%}
{.width=200%}
A state government wants to evaluate two different job training programs for unemployed workers. They randomly assigned participants to either Program A or Program B, and then tracked their employment outcomes over 6 months.
To control for individual characteristics, they matched participants across the programs based on factors like age, education level, and prior work experience - with each participant in Program A having a matched counterpart in Program B. The government wants to determine if there is a difference in mean employment rates between the two programs. Use a 0.05 level of significance to analyze the data shown in the next slide.
| Participant Pair | Employment Success (Months) | ||
|---|---|---|---|
| Program A | Program B | Difference | |
| Pair 1 | 32 | 25 | 7 |
| Pair 2 | 30 | 24 | 6 |
| Pair 3 | 19 | 15 | 4 |
| Pair 4 | 16 | 15 | 1 |
| Pair 5 | 15 | 13 | 2 |
| Pair 6 | 18 | 15 | 3 |
| Pair 7 | 14 | 15 | -1 |
| Pair 8 | 10 | 8 | 2 |
| Pair 9 | 7 | 9 | -2 |
| Pair 10 | 16 | 11 | 5 |
In this case, we can calculate the mean of the difference:
\[ \bar{d} = \bar{X}_t - \bar{X}_c = \frac{7 + 6 + 4 + 1 + 2 + 3 - 1 + 2 - 2 + 5}{10} = 2.7 \]
The standard deviation of the difference is:
\[ s = \sqrt{\frac{\sum{(d_i - \bar{d})^2}}{n-1}} = 2.9 \]
The hypotheses are:
\[ H_0: \mu_d = 0 \]
\[ H_1: \mu_d \neq 0 \]
The level of significant is 0.05
The test statistic is:
\[ \frac{\bar{d} - \mu_d}{s/\sqrt{n}} = \frac{2.7-0}{2.9/\sqrt{10}} = 2.94 \]
The critical value is: T-TABLE
Since \(2.94 > 2.262\), we reject the null hypothesis.
\[ np_1 \geq 5 \]
\[ n(1-p_1) \geq 5 \]
\[ np_2 \geq 5 \]
\[ n(1-p_2) \geq 5 \]
\[ (p_1 - p_2) \pm z_{\alpha/2}\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]
\[ H_0: p_1 - p_2 = D_0 \]
The alternative hypothesis is:
\[ H_1: p_1 - p_2 \neq D_0 \]
For a right-tailed test:
\[ H_0: p_1 - p_2 <= D_0 \]
\[ H_1: p_1 - p_2 > D_0 \]
For a left-tailed test:
\[ H_0: p_1 - p_2 >= D_0 \]
\[ H_1: p_1 - p_2 < D_0 \]
The standard error under this assumption is:
\[ \sqrt{p(1-p)(\frac{1}{n_1} + \frac{1}{n_2})} \]
And the pooled estimator is a weighted estimate of the different proportions by their sample size:
\[ \bar{p} = \frac{n_1p_1 + n_2p_2}{n_1 + n_2} \]
Giving us a test statistic:
\[ \frac{p_1 - p_2}{\sqrt{\bar{p}(1-\bar{p})(\frac{1}{n_1} + \frac{1}{n_2})}} \]
Market Research Associates is conducting research to evaluate the effectiveness of a client’s new advertising campaign. Before the new campaign began, a telephone survey of 150 households in the test market area showed 60 households “aware” of the client’s product (0.4). The new campaign has been initiated with TV and newspaper advertisements running for three weeks.
A survey conducted immediately after the new campaign showed 120 of 250 households “aware” of the client’s product (0.48). Can we conclude, using a 0.05 level of significance, that the proportion of households aware of the client’s product increased after the new advertising campaign?
The null hypothesis is:
\[ H_0: p_1 - p_2 \leq 0 \]
\[ H_1: p_1 - p_2 > 0 \]
The test statistic is:
\[ \bar{p}=\frac{250(.48)+150(.40)}{250+150}=\frac{180}{400}=.45 \]
\[ s_{\bar{p}_1-\bar{p}_2}=\sqrt{.45(.55)\left(\frac{1}{250}+\frac{1}{150}\right)}=.0514 \]
\[ z=\frac{(.48-.40)}{.0514}=\frac{.08}{.0514}=1.56 \]
The p-value would be defined by:
\[ P(Z > 1.56) = 1 - P(Z < 1.56) = 1 - .9406 = .0594 \]
Since the p-value is greater than 0.05, we fail to reject the null hypothesis.