Potential Outcomes

Author

Aleksandr Michuda

Agenda

Experimental Design
Potential Outcomes in Randomized Designs
Heavily influenced by Section 4.1 in the Causal Inference Mixtape by Scott Cunningham.

Experimental Design

Statistical studies can be classified as being experimental or observational.
In an experiment, there are attempts made to control for one or more factors so that the effect of a treatment can be studied.
In an observational study, the researcher simply observes the relationship between two or more variables.
Causal relationships are easier to establish in experimental studies
Analysis of Variance (ANOVA) can be used to analyze the data obtained from an experimental or observational study.

Potential Outcomes

Why does randomization help?
What is the difference between an experimental design and an observational study, really?
The potential outcomes framework is a way to think about these questions.

Actual Outcomes

For simplicity, we will assume a binary variable that takes on a value of 1 if a particular unit receives the treatment and a 0 if it does not.
Each unit will have two potential outcomes, but only one observed outcome. Potential outcomes are defined as \(Y_i^1\) if unit received the treatment and \(Y_i^0\) if the unit did not.
- Notice that both potential outcomes have the same subscript
- Two separate states of the world for the exact same person in our example at the exact same moment in time.
- The state of the world where no treatment occurred is the control state. Each unit has two potential outcomes: a potential outcome under a state of the world where the treatment occurred (\(Y^1\)) and a potential outcome where the treatment did not occur (\(Y^0\)).
An observed outcome is just \(Y_i\); they are the actual, realized data.

The Switching Equation

The switching equation is a way to think about the difference between the potential outcomes and the observed outcomes.
The switching equation is defined as:

\[ Y_i = D_iY_i^1 + (1-D_i)Y_i^0 \]

where \(D_i\) is a binary variable that takes on a value of 1 if the unit received the treatment and 0 if it did not.

The Fundamental Problem of Causal Inference

The fundamental problem of causal inference is that we can only observe one of the potential outcomes for each unit.
This is known as the fundamental problem of causal inference.
What we want ideally, is the treatment effect of each unit, which is defined:

\[ \delta_i = Y_i^1 - Y_i^0 \]

But we only ever know one of these

Average Treatment Effect

From this definition, we get three interesting quantities:

The average treatment effect (ATE) is defined as:

\[ \delta = E[Y^1 - Y^0] = E[Y^1] - E[Y^0] \]

The average treatment effect on the treated (ATT) is defined as:

\[ \delta_{ATT} = E[Y^1 - Y^0 | D = 1] \]

\[ = E[Y^1 | D = 1] - E[Y^0 | D = 1] \]

The average treatment effect on the untreated (ATU) is defined as:

\[ \delta_{ATU} = E[Y^1 - Y^0 | D = 0] \]

\[ = E[Y^1 | D = 0] - E[Y^0 | D = 0] \]

All three of these quantities are population means and are fundamentally unobservable.

The ATT

The ATT is the average treatment effect for those who actually received the treatment.
The ATT is almost always different from the ATE because people will sort into a treatment based on whether it will help them.
Notice that we need two observations for each unit to calculate the ATT.

The ATU

Similar to the ATT, the ATU is the average treatment effect for those who did not receive the treatment.
We also need two observations on each unit to calculate the ATU.
\(ATT \neq ATU\) in general, because there is heterogeneity in treatment effects
- The treatment effect is different for different people

Potential Outcomes as Difference in Means

Suppose we could observe the potential outcomes of ten patients receiving surgery and their post-treatment lifespan.
In that case we would just have a matched sample of the potential outcomes.

Patient	Surgery (\(Y^1\))	No Surgery (\(Y^0\))	\(\delta\)
1	7	1	6
2	5	6	-1
3	5	1	4
4	7	8	-1
5	4	2	2

We can calculate the average treatment effect in that case:

\[ \delta = \frac{1}{5} \sum_{i=1}^{5} \delta_i = 2 \]

So the average treatment effect is 2 more years of life for those who receive surgery over chemo.

Notice that not everyone benefits: patient 2 would have been better off with chemo, and patient 4 would have been better off with surgery.

Adding the Switching Equation

Let’s now assume that we had a perfect doctor who sorted patients based on whether they would benefit from surgery.

Patient	Surgery (\(Y^1\))	No Surgery (\(Y^0\))	\(D\)	\(Y\)
1	7	1	1	7
2	5	6	0	6
3	5	1	1	5
4	7	8	0	8
5	4	2	1	4

\(Y\) and \(D\) represent the observed outcome and treatment assignment, respectively.

From this we can see that the ATE can be calculated as a weighted average of the ATT and ATU.

The ATT is:

\[ ATT = \frac{1}{3} (7 + 5 + 4) - \frac{1}{3} (1 + 1 + 2) = 4 \]

and the ATU is:

\[ ATU = \frac{1}{2} (5 + 7) - \frac{1}{2} (6 + 8) = -1 \]

So the ATE is:

\[ ATE = \pi ATT + (1-\pi)ATU = 0.6(4) + 0.4(1) = 2 \]

Simple Difference in Means

We can calculate the average treatment effect as the difference in means:

\[ SDO = E[Y^1 | D = 1] - E[Y^0 | D = 0] \]

\[ = \frac{1}{3} (7 + 5 + 4) - \frac{1}{2} (6 + 8) \]

\[ = 5.33 - 7 = -1.67 \]

The simple difference in means is -1.67 years of life for those who receive surgery over not receiving it.
This means that the treatment group has a lower average lifespan than the control group even though the doctor sorted patients based on their outcomes. Why?

Selection Bias

This statistic is biased because the doctor sorted patients based on their outcomes, creating fundamental differences between the treatment and control groups that are directly related to the potential outcomes, themselves.

Decomposing the Difference in Means

To show where this is coming from, we can decompose the simple difference in means into three parts:

\[ \begin{aligned} E\big[Y^1\mid D=1\big]-E\big[Y^0\mid D=0\big] & =ATE \nonumber \\ & + E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big] \nonumber \\ & + (1-\pi)(ATT-ATU) \end{aligned} \]

How do we get there?

Decomposing the Difference in Means

In order to get to that equation, we need to start \(ATE\) and work our way to the simple difference in means.

\[ ATE = \pi ATT + (1-\pi)ATU \]

\[ = \pi E[Y^1|D=1] - \pi E[Y^0|D=1] \]

\[ + (1-\pi)E[Y^1|D=0] - (1-\pi)E[Y^0|D=0] \]

\[ = \color{red}{\left(\pi E[Y^1|D=1] + (1-\pi)E[Y^1|D=0]\right)} \]

\[ - \color{blue}{\left( \pi E(Y^0|D=1) + (1-\pi)E(Y^0 | D=0) \right)} \]

This goes on for many lines…

Decomposing the Difference in Means

\[ \begin{aligned} E\big[Y^1\mid D=1\big] & = a \\ E\big[Y^1\mid D=0\big] & = b \\ E\big[Y^0\mid D=1\big] & = c \\ E\big[Y^0\mid D=0\big] & = d \\ ATE & = e \end{aligned} \]

\[ \begin{aligned} e & =\big\{\pi{a}+(1-\pi)b\big\}-\big\{\pi{c} + (1-\pi)d\big\} \\ e & =\pi{a}+b-\pi{b}-\pi{c} - d + \pi{d} \\ e & =\pi{a}+ b-\pi{b}-\pi{c} - d + \pi{d} + (\mathbf{a} - \mathbf{a}) + (\mathbf{c} - \mathbf{c}) + (\mathbf{d} - \mathbf{d}) \\ 0 & =e-\pi{a} - b + \pi{b} + \pi{c} + d - \pi{d} - \mathbf{a} + \mathbf{a} - \mathbf{c} + \mathbf{c} - \mathbf{d} + \mathbf{d} \\ \mathbf{a}-\mathbf{d} & =e-\pi{a} - b + \pi{b} + \pi{c} + d - \pi{d} +\mathbf{a} -\mathbf{c} +\mathbf{c} - \mathbf{d} \\ \mathbf{a}-\mathbf{d} & =e + (\mathbf{c} -\mathbf{d}) + \mathbf{a}-\pi{a} - b + \pi{b} -\mathbf{c} + \pi{c} + d - \pi{d} \\ \mathbf{a}-\mathbf{d} & =e + (\mathbf{c} -\mathbf{d}) + (1-\pi)a -(1-\pi)b + (1-\pi)d - (1-\pi)c \\ \mathbf{a}-\mathbf{d} & =e + (\mathbf{c} -\mathbf{d}) + (1-\pi)(a-c) -(1-\pi)(b-d) \end{aligned} \]

Decomposing the Difference in Means

\[ \begin{aligned} \underbrace{\dfrac{1}{N_T} \sum_{i=1}^n \big(y_i\mid d_i=1\big)-\dfrac{1}{N_C} \sum_{i=1}^n \big(y_i\mid d_i=0\big)}_{ \text{Simple Difference in Outcomes}} &= \underbrace{E[Y^1] - E[Y^0]}_{ \text{Average Treatment Effect}} \\ &+ \underbrace{E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]}_{ \text{Selection bias}} \\ & + \underbrace{(1-\pi)(ATT - ATU)}_{ \text{Heterogeneous treatment effect bias}} \end{aligned} \]

This gives us three important terms that are each important to doing any policy evaluation.
We know the first term is equal to -1.67
We know from above that the ATE is equal to 2.
The second term is the selection bias term.
It represents the inherent difference between the two groups if they had never received a treatment.
The first is counterfactual and the second is the observed outcome.
In this case, this quantity is: ?

Decomposing the Difference in Means

The quantity is:

\[ \frac{1}{3}(1+1+2) - \frac{1}{2}(6+8) = 4/3 - 7 = -5.667 \]

Decomposing the Difference in Means

The last term is the heterogeneous treatment effect bias and always present
The different returns to surgery for the two groups multiplied by the share of the population that didn’t receive the surgery.
(4-1)*.4 = 1.2

If we add these up, we get:

\[ -1.67 = 2 - 5.667 + 2 \]

Decomposing the Difference in Means

In real life though, we cannot observe the potential outcomes
We cannot calculate each of these terms because we do not have access to what would have happened
We need to think of strategies to take out these extra quantities so that we are left with the ATE.
More often than not, we need to figure out what needs to be true in order to use the simple difference to estimate the ATE.
This is in fact the fundamental problem of causal inference.

The Independence Assumption

It turns out that causal inference will turn on the ability to make a particular assumption.
This assumption is called the independence assumption.

\[ (Y^1,Y^0) \perp D \]

This means that the potential outcomes are independent of the treatment assignment.
This is a very strong assumption.
In the surgery example, this translates to the situation where the surgery was assigned in a way that had nothing to do with the gains to surgery.
This is violated by construction of the example.
A patient received the surgery if \(Y^1 > Y^0\) and did not receive the surgery if \(Y^1 < Y^0\).
- Which is in fact the opposite: D very much depends on the potential outcomes.

The Independence Assumption

What if the doctor hadn’t done that?
It doesn’t necessarily mean they randomized the treatment.
It could mean that the doctor alphabetized the patients and then assigned the treatment based on that.
Or maybe the doctor assigned it based on the something else that had nothing to do with the treatment.
What would that mean?

The Independence Assumption

\[ E\big[Y^1\mid D=1\big] - E\big[Y^1\mid D=0\big]=0 \]

\[ E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]=0 \]

What does this mean?
It means that the average potential outcomes are the same for those who received the treatment and those who did not.
This zeroes out the selection bias term.
It also zeroes out the heterogeneous treatment effect bias term.
- How?

\[ \begin{aligned} ATT = E\big[Y^1\mid D=1\big] - E\big[Y^0\mid D=1\big] \\ ATU = E\big[Y^1\mid D=0\big] - E\big[Y^0\mid D=0\big] \end{aligned} \]

This just leaves that the mean difference equals ATE.

The Independence Assumption

This, however, is not very realistic in the observational setting.
In the observational setting, we need to make assumptions about the data generating process.
This means that a simple comparison in means is no longer enough for a causal impact.

SUTVA

Stable Unit Treatment Value Assumption (SUTVA) is a way to think about the independence assumption.
Potential outcomes framework places limits on calculating treatment effects.
Each unit receives the same sized dose, no spillovers to other units’ potential outcomes when a unit is exposed to some treatment, and no general equilibrium effects.

Homogeneous Treatment

Treatment is received in homogeneous doses to all units.
- It’s easy to imagine violations of this, though—for instance, if some doctors are better surgeons than others.

No Spillovers

Second, this implies that there are no externalities, because by definition, an externality spills over to other untreated units.
- In other words, if unit 1 receives the treatment, and there is some externality, then unit 2 will have a different \(Y^0\) value than if unit 1 had not received the treatment.
When there are such spillovers, though, such as when we are working with social network data, we will need to use models that can explicitly account for such SUTVA violations

General Equilibrium

Let’s say we are estimating the causal effect of returns to schooling. The increase in college education would in general equilibrium cause a change in relative wages that is different from what happens under partial equilibrium.
This kind of scaling-up issue is of common concern when one considers extrapolating from the experimental design to the large-scale implementation of an intervention in some population.

Other Formats

Agenda

Experimental Design

Potential Outcomes

Actual Outcomes

The Switching Equation

The Fundamental Problem of Causal Inference

Average Treatment Effect

The ATT

The ATU

Potential Outcomes as Difference in Means

Adding the Switching Equation

Simple Difference in Means

Selection Bias

Decomposing the Difference in Means

Decomposing the Difference in Means

Decomposing the Difference in Means

Decomposing the Difference in Means

Decomposing the Difference in Means

Decomposing the Difference in Means

Decomposing the Difference in Means

The Independence Assumption

The Independence Assumption

The Independence Assumption

The Independence Assumption

SUTVA

Homogeneous Treatment

No Spillovers

General Equilibrium