Hypothesis Testing

Aleksandr Michuda

Agenda

Hypothesis Testing
Type-I and Type-II errors
P-Values
Type-II Error and Power

Definitions

Hypothesis testing can be used to determine whether a statement about the value of a population parameter should or should not be rejected.
The null hypothesis, denoted by \(H_0\) , is a tentative assumption about a population parameter.
The alternative hypothesis, denoted by \(H_a\), is the opposite of what is stated in the null hypothesis.
The hypothesis testing procedure uses data from a sample to test the two competing statements indicated by H0 and Ha.

Developing Hypotheses

It is not always obvious how the null and alternative hypotheses should be formulated.
Care must be taken to structure the hypotheses appropriately so that the test conclusion provides the information the researcher wants.
The context of the situation is very important in determining how the hypotheses should be stated.
In some cases it is easier to identify the alternative hypothesis first. In other cases the null is easier.

Developing Hypotheses

Alternative Hypothesis as a Research Hypothesis

Many applications of hypothesis testing involve an attempt to gather evidence in support of a research hypothesis.
In such cases, it is often best to begin with the alternative hypothesis and make it the conclusion that the researcher hopes to support.
The conclusion that the research hypothesis is true is made if the sample data provides sufficient evidence to show that the null hypothesis can be rejected.

Example: A new teaching method is developed that is believed to be better than the current method.

Null Hypothesis: The new method is no better than the old method.

Alternative Hypothesis: The new teaching method is better.

Developing Null and Alternative Hypotheses

Example: A new bonus program in Uber increases the number of rides.

Null Hypothesis: The new bonus program does not increase the number of rides.

Alternative Hypothesis: The new bonus program increases the number of rides.

Example: Coke and Pepsi are substitute

Null Hypothesis: ?

Alternative Hypothesis: ?

The Null Hypothesis

Null Hypothesis as an Assumption to be Challenged

We might begin with a belief or assumption that a statement about the value of a population parameter is true.
We then use a hypothesis test to challenge the assumption and determine if there is statistical evidence to conclude that the assumption is incorrect.
In these situations, it is helpful to develop the null hypothesis first.

Example: The label on a soft drink bottle states that it contains 67.6 fluid ounces.

\(H_0\): \(\mu \geq 67.6\)

\(H_a\): \(\mu < 67.6\)

The Null Hypothesis and Different tailed tests

Equality is always in the null hypothesis
One tailed test, lower tail: \(H_0: \mu \geq \mu_0\) vs \(H_a: \mu < \mu_0\)
One tailed test, upper tail: \(H_0: \mu \leq \mu_0\) vs \(H_a: \mu > \mu_0\)
Two-tailed Test: \(H_0: \mu = \mu_0\) vs \(H_a: \mu \neq \mu_0\)

Example

A major west coast city provides one of the most comprehensive emergency medical services in the world. Operating in a multiple hospital system with approximately 20 mobile medical units, the service goal is to respond to medical emergencies with a mean time of 12 minutes or less.

The director of medical services wants to formulate a hypothesis test that could use a sample of emergency response times to determine whether or not the service goal of 12 minutes or less is being achieved.

Example

\(H_0\): \(\mu \geq 12\)

\(H_a\): \(\mu < 12\)

for \(\mu\) the population mean response time for all emergencies.

Type-I Error

Because hypothesis tests are based on sample data, we must allow for the possibility of errors.
A Type I error is rejecting \(H_0\) when it is true.
A false positive is another term for a Type I error.
The probability of making a Type I error when the null hypothesis is true as an equality is called the level of significance.
Applications of hypothesis testing that only control for the Type I error are often called significance tests.

Type-II Error

A Type II error is accepting \(H_0\) when it is false.
It is difficult to control for the probability of making a Type II error.
A false negative is another term for a Type II error.
Statisticians avoid the risk of making a Type II error by using “do not reject \(H_0\)” rather than “accept \(H_0\)”.
But implicitly, the probability of making a Type II error is controlled by the sample size and the level of significance, but just not in a way that we talk about.

A Table of Type-I and Type-II Errors

Decision	\(H_0\) is True	\(H_0\) is False
Do not reject \(H_0\)	Correct Decision	Type II Error
Reject \(H_0\)	Type I Error	Correct Decision

Type-I and Type-II Errors

Both are important, but econometrics often cares about Type-I errors.
Much of what we will be talking about for the rest of the lecture will be about controlling for Type-I errors.
Type-II errors are important, but are often more difficult to control for and need to be thought through BEFORE the data is collected.
It’s also about the cost of making a mistake.
Why might committing a Type-I error be more costly than a Type-II error?

P-values for one-tailed tests

The p-value is the probability, computed using the test statistic, that measures the support (or lack of support) provided by the sample for the null hypothesis.
If the p-value is less than or equal to the level of significance α, the value of the test statistic is in the rejection region.
Reject \(H_0\) if the p-value \(\leq \alpha\).
Conditional on the null hypothesis being true, the p-value is showing us how rare the estimate we got is. If it is sufficiently rare, then it cannot be part of a distribution that is centered around the null hypothesis, thus we reject the null hypothesis.

Where did these numbers come from?

Convention
The statistician R.A. Fisher suggested that 0.05 was a good number to use.
In his 1925 book, Statistical Methods for Research Workers, Fisher proposed this threshold as a convenient cutoff for determining statistical significance.
He noted that a p-value of 0.05 corresponds to deviations approximately twice the standard deviation in a normal distribution, making it a practical benchmark for researchers.
Empirical Rule!

Lower-tailed Tests

Code

stddev=1;
mean=0;

jStat = require("https://cdn.jsdelivr.net/npm/jstat@latest/dist/jstat.min.js");

viewof alpha = Inputs.range([0.01,.9999], 0.01, {label: "Alpha"})
viewof p_value = Inputs.range([0.01,.9999], 0.01, {label: "P-Value"})

critical_value = jStat.normal.inv(alpha, mean, stddev)

p_value_crit = jStat.normal.inv(p_value, mean, stddev)

Plot.plot({
  marks: [
    Plot.ruleX([critical_value], {strokeDasharray: "5,5"}),
    Plot.line(d3.range(-10, 10, 0.01).map(x => ({x, y: jStat.normal.pdf(x, mean, stddev)})), {x: "x", y: "y"}),
    Plot.areaY(d3.range(-4, p_value_crit, 0.01).map(x => ({x, y: jStat.normal.pdf(x, mean, stddev)})), {x: "x", y: "y", fillOpacity: 0.3})  ],
  x: {label: "x"},
  y: {label: "Probability Density"}
})

Upper-tailed Tests

Code

viewof alpha2 = Inputs.range([0.01,.9999], 0.01, {label: "Alpha"})
viewof p_value2 = Inputs.range([0.01,.9999], 0.01, {label: "P-Value"})

critical_value2 = jStat.normal.inv(1-alpha2, mean, stddev)

p_value_crit2 = jStat.normal.inv(1-p_value2, mean, stddev)

Plot.plot({
  marks: [
    Plot.ruleX([critical_value2], {strokeDasharray: "5,5"}),
    Plot.line(d3.range(-10, 10, 0.01).map(x => ({x, y: jStat.normal.pdf(x, mean, stddev)})), {x: "x", y: "y"}),
    Plot.areaY(d3.range(p_value_crit2, 4, 0.01).map(x => ({x, y: jStat.normal.pdf(x, mean, stddev)})), {x: "x", y: "y", fillOpacity: 0.3})  ],
  x: {label: "x"},
  y: {label: "Probability Density"}
})

Hypothesis Tests

We can either calculate the test statistic and see if it is in the rejection region or we can calculate the p-value and see if it is less than the level of significance.

Critical Value Approach

The test statistic \(z\) has a standard normal probability distribution.

We can use the standard normal probability distribution table to find the z-value with an area of \(\alpha\) in the lower (or upper) tail of the distribution.

The value of the test statistic that established the boundary of the rejection region is called the critical value for the test.

The rejection rule is:

Lower tail: Reject \(H_0\) if \(z < -z_{\alpha}\)
Upper tail: Reject \(H_0\) if \(z > z_{\alpha}\)

Example

For an \(\bar{x}\) of 12.5, a known \(\sigma\) of 2.5, \(n\) of 25, and \(\alpha\) of 0.05, we have, calculate the test statistic of the null hypothesis that the population mean \(\geq\) 12.

\[ H_0: \mu \leq 12 \]

\[ H_a: \mu > 12 \]

\[ z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

\[ z_{test} = \frac{12.5 - 12}{\frac{2.5}{\sqrt{25}}} = 1 \]

Example

For an alpha of 0.05, test the null hypothesis that the population mean is greater than or equal to 12.

For this you need the critical value. The critical value, \(z_{\alpha}\) solves the equation:

\[ P(Z \geq z_{\alpha}) = \alpha \]

\[ P(Z \geq 1.645) = 0.05 \]

Our test is then:

\[ z > 1.645 = False \]

Thus we fail to reject the null hypothesis.

P-value Approach

For the p-value, you generate the same null and alternative hypotheses, but now you find the probability of the test-statistic, not find the critical value.

Then you reject if

\[ p \leq \alpha \]

Example

For the previous example, find the p-value and test the null hypothesis that the population mean is greater than or equal to 12.

The p-value in this example solves the equation:

\[ P( Z \geq z_{test}) = 1 - P(Z \leq 1) = 1 - 0.8413 = 0.1587 \]

Code

critical_value3 = jStat.normal.inv(1-0.05, mean, stddev)

p_value_crit3 = jStat.normal.inv(0.8413, mean, stddev)

Plot.plot({
  marks: [
    Plot.ruleX([critical_value3], {strokeDasharray: "5,5"}),
    Plot.ruleX([1]),
    Plot.line(d3.range(-10, 10, 0.01).map(x => ({x, y: jStat.normal.pdf(x, mean, stddev)})), {x: "x", y: "y"}),
    Plot.areaY(d3.range(p_value_crit3, 4, 0.01).map(x => ({x, y: jStat.normal.pdf(x, mean, stddev)})), {x: "x", y: "y", fillOpacity: 0.3})  ],
  x: {label: "x"},
  y: {label: "Probability Density"}
})

Lower-tailed Tests

The response times for a random sample of 40 medical emergencies were tabulated. The sample mean is 10 minutes. The population standard deviation is believed to be 3.2 minutes.

The EMS director wants to perform a hypothesis test, with a .05 level of significance, to determine whether the service goal of 12 minutes or less is being achieved.

The Hypotheses are:

\[ H_0: \mu \geq 10 \]

\[ H_a: \mu < 10 \]

Lower-tailed Tests

Critical Value Approach:

\[ z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

\[ z = \frac{10 - 12}{\frac{3.2}{\sqrt{40}}} = -3.95 \]

The critical value in this case is: -1.645

We reject the null if:

\[ z < -1.645 == True \]

So reject the null hypothesis.

P-value Approach:

\[ P(Z \leq -3.95) = 0.00003 \]

We reject if:

\[ p \leq 0.05 == True \]

Two-tailed Tests

The null hypothesis is that the population mean is equal to a specific value
The alternative hypothesis is that the population mean is not equal to that value
Since it can be on either side of the distribution, we divide alpha by 2 and find the critical values for the test statistic that are symmetric around the mean.

Two-tailed Tests

We reject if:

\[ z < -z_{\alpha/2} \text{ or } z > z_{\alpha/2} \]

Example

A bonus program in Uber is being tested to understand whether it increases the number of rides in NYC. The company wants to test whether it changes the number of rides. After collecting data of 100 drivers, they find that drivers that received the bonus program increased their rides by 30 on average. Assume that Uber knows the population standard deviation and it is 100. Did this bonus program have an impact (negative or positive) at the 5% level?

The hypotheses are:

\[ H_0: \mu = 0 \]

\[ H_a: \mu \neq 0 \]

Critical Value Approach:

\[ z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}} \]

\[ z = \frac{30 - 0}{\frac{100}{\sqrt{100}}} = 3 \]

To find the critical values, we need to find an alpha of 0.025 in the upper and lower tails.

\[ P(Z \geq z_{\alpha/2}) = 0.025 \]

\[ P(Z \geq 1.96) = 0.025 \]

We reject if:

\[ 3 < -1.96 \text{ or } 3 > 1.96 \]

In this case, we reject the null.

P-value Approach

The p-value in this case is a bit different. Since the test is two-tailed, our p-value will be \(2\cdot P(Z \geq 3)\)

\[ P(Z \geq 3) = 0.0013 \]

So the p-value is: 0.0026

We reject if:

\[ p \leq 0.05 == True \]

Two-tailed Tests

Two-tailed tests are used very often in econometrics.
The regression table p-values and test-statistics you see are two-tailed tests.
It is useful when you don’t know a priori which direction the effect will go.

Confidence Intervals for a Two-tailed Test

If the confidence interval contains the hypothesized value \(\mu_0\), do not reject \(H_0\). Otherwise, reject \(H_0\).

Example

For the Uber problem above, construct a 95% confidence interval for the impact of the bonus program.

\[ CI = \bar{x} \pm z_{\alpha/2} \cdot \frac{\sigma}{\sqrt{n}} \]

\[ CI = 30 \pm 1.96 \cdot \frac{100}{\sqrt{100}} \]

\[ CI = 30 \pm 19.6 \]

Since 0 (our null hypothesis) is not in the confidence interval, we reject the null hypothesis.

Hypothesis Tests with Unknown \(\sigma\)

Similar to what we previously talked about, we don’t know the population standard deviation and so must use the t-distribution.

The test statistic is:

\[ t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} \]

where \(s\) is the sample standard deviation.

Example

A State Highway Patrol periodically samples vehicle speeds at various locations on a particular roadway. The sample of vehicle speeds is used to test the hypothesis \(H_0: \mu \leq 65\).

The locations where \(H_0\) is rejected are deemed the best locations for radar traps. At Location F, a sample of 64 vehicles shows a mean speed of 66.2 mph with a sample standard deviation of 4.2 mph. Use \(\alpha = 0.05\) to test the hypothesis.

Example

The hypotheses are:

\[ H_0: \mu \leq 65 \]

\[ H_a: \mu > 65 \]

Test Statistic Approach:

\[ t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} \]

\[ t = \frac{66.2 - 65}{\frac{4.2}{\sqrt{64}}} = 2.286 \]

Example

But since we have a distribution, we need input the degrees of freedom!

For \(\alpha=0.05\) and \(df=64-1=63\), we find the critical value:

\[ t_{\alpha} = 1.67 \]

So we reject if:

\[ t > 1.67 == True \]

P-value Approach:

\[ P(T \geq 2.286) = 0.013 \]

We reject if:

\[ p \leq 0.05 == True \]

Tests with Population Proportions

Since we can use the normal distribution to approximate the binomial distribution, we can use the normal distribution to test hypotheses about population proportions.

The test statistic is:

\[ z = \frac{\bar{p} - p}{\sqrt{\frac{p(1-p)}{n}}} \]

Notice that the standard deviation is \(\sqrt{\frac{p(1-p)}{n}}\).

We use the hypothesis value in the SD! Why?

In this case, we have a known \(\sigma\) in a sense because the variability of the sample proportion is based on the population proportion.

Example

For a Christmas and New Year’s week, the National Safety Council estimated that 500 people would be killed and 25,000 injured on the nation’s roads. The NSC claimed that 50% of the accidents would be caused by drunk driving.

A sample of 120 accidents showed that 67 were caused by drunk driving. Use these data to test the NSC’s claim with \(\alpha\) = 0.05.

Example

The hypotheses are:

\[ H_0: p=.5 \]

\[ H_a: p \neq .5 \]

The test statistic is:

\[ z = \frac{(67/120) - .5}{\sqrt{\frac{.5(1-.5)}{120}}}=1.28 \]

The critical value is: 1.96

Since 1.28 is not less than -1.96 and not more than 1.96, we fail to reject the null hypothesis.

Multiple Hypothesis Tests

Oftentimes in research, you’ll want to test multiple hypotheses to understand the impact of different variables.
This can lead to a problem of multiple hypothesis testing.
The more hypotheses you test, the more likely you are to find a significant result by chance.

Multiple Hypothesis Tests

Even with a small \(\alpha\), say at 0.05, you’ll still find a significant result by chance.
Let’s say you are testing \(M\) hypotheses, with an \(\alpha\) of 0.05.
Let’s say that each hypothesis is independent of each other.
The probability of finding at least one significant result is:
Remember that an \(\alpha\) is a Type-I error or the probability of rejecting the null hypothesis when it is true.

\[ Pr(\text{At least one significant result with true null hypothesis}) = 1- Pr(\text{No significant results with true null hypothesis}) \]

\[ 1 - (1 - \alpha)^M \]

As we increase M, look at the table to see how this probability changes:

M	Probability
1	0.05
2	0.0975
5	0.226
10	0.401
20	0.641
50	0.923

So even with just 5 test, we have a 20% chance!

Multiple Hypothesis Tests

How can we deal with this issue?
There is a large literature of dealing with multiple hypothesis tests.
The simplest one is known as the Bonferroni correction.

Bonferroni Correction

The Bonferroni correction is a simple way to control for multiple hypothesis tests.
It is a very conservative method.
It divides the \(\alpha\) by the number of tests.

\[ \alpha_{\text{new}} = \frac{\alpha}{M} \]

This is derived from the fact that if we want to set an experiment-wide Type-I error, \(\pi\), we can do so with a trick:

\[ (1-\alpha)^M \approx 1 - M\alpha \]

So setting \(\alpha\) to \(\pi/M\) will give us a Type-I error of \(\pi\).

So at an \(\alpha\) of 0.05 and 5 tests, we would set \(\alpha\) to 0.01.

Hypothesis Testing

Agenda

Definitions

Developing Hypotheses

Developing Hypotheses

Developing Null and Alternative Hypotheses

The Null Hypothesis

The Null Hypothesis and Different tailed tests

Example

Example

Type-I Error

Type-II Error

A Table of Type-I and Type-II Errors

Type-I and Type-II Errors

P-values for one-tailed tests

Suggested Guidelines in Social Science for P-values

Where did these numbers come from?

Lower-tailed Tests

Upper-tailed Tests

Hypothesis Tests

Critical Value Approach

Example

Example

P-value Approach

Example

Lower-tailed Tests

Lower-tailed Tests

Two-tailed Tests

Two-tailed Tests

Example

P-value Approach

Two-tailed Tests

Confidence Intervals for a Two-tailed Test

Example

Hypothesis Tests with Unknown \(\sigma\)

Example

Example

Example

Tests with Population Proportions

Example

Example

Multiple Hypothesis Tests

Multiple Hypothesis Tests

Multiple Hypothesis Tests

Bonferroni Correction