Potential Outcomes

Author

Aleksandr Michuda

Agenda

  • Experimental Design
  • Potential Outcomes in Randomized Designs
  • Heavily influenced by Section 4.1 in the Causal Inference Mixtape by Scott Cunningham.

Experimental Design

  • Statistical studies can be classified as being experimental or observational.
  • In an experiment, there are attempts made to control for one or more factors so that the effect of a treatment can be studied.
  • In an observational study, the researcher simply observes the relationship between two or more variables.
  • Causal relationships are easier to establish in experimental studies
  • Analysis of Variance (ANOVA) can be used to analyze the data obtained from an experimental or observational study.

Potential Outcomes

  • Why does randomization help?
  • What is the difference between an experimental design and an observational study, really?
  • The potential outcomes framework is a way to think about these questions.

Actual Outcomes

  • For simplicity, we will assume a binary variable that takes on a value of 1 if a particular unit receives the treatment and a 0 if it does not.
  • Each unit will have two potential outcomes, but only one observed outcome. Potential outcomes are defined as \(Y_i^1\) if unit received the treatment and \(Y_i^0\) if the unit did not.
    • Notice that both potential outcomes have the same subscript
    • Two separate states of the world for the exact same person in our example at the exact same moment in time.
    • The state of the world where no treatment occurred is the control state. Each unit has two potential outcomes: a potential outcome under a state of the world where the treatment occurred (\(Y^1\)) and a potential outcome where the treatment did not occur (\(Y^0\)).
  • An observed outcome is just \(Y_i\); they are the actual, realized data.

The Switching Equation

  • The switching equation is a way to think about the difference between the potential outcomes and the observed outcomes.
  • The switching equation is defined as:

\[ Y_i = D_iY_i^1 + (1-D_i)Y_i^0 \]

where \(D_i\) is a binary variable that takes on a value of 1 if the unit received the treatment and 0 if it did not.

The Fundamental Problem of Causal Inference

  • The fundamental problem of causal inference is that we can only observe one of the potential outcomes for each unit.
  • This is known as the fundamental problem of causal inference.
  • What we want ideally, is the treatment effect of each unit, which is defined:

\[ \delta_i = Y_i^1 - Y_i^0 \]

But we only ever know one of these

Average Treatment Effect

From this definition, we get three interesting quantities:

  • The average treatment effect (ATE) is defined as:

\[ \delta = E[Y^1 - Y^0] = E[Y^1] - E[Y^0] \]

  • The average treatment effect on the treated (ATT) is defined as:

\[ \delta_{ATT} = E[Y^1 - Y^0 | D = 1] \]

\[ = E[Y^1 | D = 1] - E[Y^0 | D = 1] \]

  • The average treatment effect on the untreated (ATU) is defined as:

\[ \delta_{ATU} = E[Y^1 - Y^0 | D = 0] \]

\[ = E[Y^1 | D = 0] - E[Y^0 | D = 0] \]

All three of these quantities are population means and are fundamentally unobservable.

The ATT

  • The ATT is the average treatment effect for those who actually received the treatment.
  • The ATT is almost always different from the ATE because people will sort into a treatment based on whether it will help them.
  • Notice that we need two observations for each unit to calculate the ATT.

The ATU

  • Similar to the ATT, the ATU is the average treatment effect for those who did not receive the treatment.
  • We also need two observations on each unit to calculate the ATU.
  • \(ATT \neq ATU\) in general, because there is heterogeneity in treatment effects
    • The treatment effect is different for different people

Potential Outcomes as Difference in Means

  • Suppose we could observe the potential outcomes of ten patients receiving surgery and their post-treatment lifespan.
  • In that case we would just have a matched sample of the potential outcomes.
Patient Surgery (\(Y^1\)) No Surgery (\(Y^0\)) \(\delta\)
1 7 1 6
2 5 6 -1
3 5 1 4
4 7 8 -1
5 4 2 2
  • We can calculate the average treatment effect in that case:

\[ \delta = \frac{1}{5} \sum_{i=1}^{5} \delta_i = 2 \]

So the average treatment effect is 2 more years of life for those who receive surgery over chemo.

Notice that not everyone benefits: patient 2 would have been better off with chemo, and patient 4 would have been better off with surgery.

Adding the Switching Equation

  • Let’s now assume that we had a perfect doctor who sorted patients based on whether they would benefit from surgery.
Patient Surgery (\(Y^1\)) No Surgery (\(Y^0\)) \(D\) \(Y\)
1 7 1 1 7
2 5 6 0 6
3 5 1 1 5
4 7 8 0 8
5 4 2 1 4

\(Y\) and \(D\) represent the observed outcome and treatment assignment, respectively.

From this we can see that the ATE can be calculated as a weighted average of the ATT and ATU.

The ATT is:

\[ ATT = \frac{1}{3} (7 + 5 + 4) - \frac{1}{3} (1 + 1 + 2) = 4 \]

and the ATU is:

\[ ATU = \frac{1}{2} (5 + 7) - \frac{1}{2} (6 + 8) = -1 \]

So the ATE is:

\[ ATE = \pi ATT + (1-\pi)ATU = 0.6(4) + 0.4(1) = 2 \]

Simple Difference in Means

  • We can calculate the average treatment effect as the difference in means:

\[ SDO = E[Y^1 | D = 1] - E[Y^0 | D = 0] \]

\[ = \frac{1}{3} (7 + 5 + 4) - \frac{1}{2} (6 + 8) \]

\[ = 5.33 - 7 = -1.67 \]

  • The simple difference in means is -1.67 years of life for those who receive surgery over not receiving it.
  • This means that the treatment group has a lower average lifespan than the control group even though the doctor sorted patients based on their outcomes. Why?

Selection Bias

  • This statistic is biased because the doctor sorted patients based on their outcomes, creating fundamental differences between the treatment and control groups that are directly related to the potential outcomes, themselves.

Decomposing the Difference in Means

  • To show where this is coming from, we can decompose the simple difference in means into three parts:

\[ \begin{aligned} E\big[Y^1\mid D=1\big]-E\big[Y^0\mid D=0\big] & =ATE \nonumber \\ & + E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big] \nonumber \\ & + (1-\pi)(ATT-ATU) \end{aligned} \]

  • How do we get there?

Decomposing the Difference in Means

  • In order to get to that equation, we need to start \(ATE\) and work our way to the simple difference in means.

\[ ATE = \pi ATT + (1-\pi)ATU \]

\[ = \pi E[Y^1|D=1] - \pi E[Y^0|D=1] \]

\[ + (1-\pi)E[Y^1|D=0] - (1-\pi)E[Y^0|D=0] \]

\[ = \color{red}{\left(\pi E[Y^1|D=1] + (1-\pi)E[Y^1|D=0]\right)} \]

\[ - \color{blue}{\left( \pi E(Y^0|D=1) + (1-\pi)E(Y^0 | D=0) \right)} \]

This goes on for many lines…

Decomposing the Difference in Means

\[ \begin{aligned} E\big[Y^1\mid D=1\big] & = a \\ E\big[Y^1\mid D=0\big] & = b \\ E\big[Y^0\mid D=1\big] & = c \\ E\big[Y^0\mid D=0\big] & = d \\ ATE & = e \end{aligned} \]

\[ \begin{aligned} e & =\big\{\pi{a}+(1-\pi)b\big\}-\big\{\pi{c} + (1-\pi)d\big\} \\ e & =\pi{a}+b-\pi{b}-\pi{c} - d + \pi{d} \\ e & =\pi{a}+ b-\pi{b}-\pi{c} - d + \pi{d} + (\mathbf{a} - \mathbf{a}) + (\mathbf{c} - \mathbf{c}) + (\mathbf{d} - \mathbf{d}) \\ 0 & =e-\pi{a} - b + \pi{b} + \pi{c} + d - \pi{d} - \mathbf{a} + \mathbf{a} - \mathbf{c} + \mathbf{c} - \mathbf{d} + \mathbf{d} \\ \mathbf{a}-\mathbf{d} & =e-\pi{a} - b + \pi{b} + \pi{c} + d - \pi{d} +\mathbf{a} -\mathbf{c} +\mathbf{c} - \mathbf{d} \\ \mathbf{a}-\mathbf{d} & =e + (\mathbf{c} -\mathbf{d}) + \mathbf{a}-\pi{a} - b + \pi{b} -\mathbf{c} + \pi{c} + d - \pi{d} \\ \mathbf{a}-\mathbf{d} & =e + (\mathbf{c} -\mathbf{d}) + (1-\pi)a -(1-\pi)b + (1-\pi)d - (1-\pi)c \\ \mathbf{a}-\mathbf{d} & =e + (\mathbf{c} -\mathbf{d}) + (1-\pi)(a-c) -(1-\pi)(b-d) \end{aligned} \]

Decomposing the Difference in Means

\[ \begin{aligned} \underbrace{\dfrac{1}{N_T} \sum_{i=1}^n \big(y_i\mid d_i=1\big)-\dfrac{1}{N_C} \sum_{i=1}^n \big(y_i\mid d_i=0\big)}_{ \text{Simple Difference in Outcomes}} &= \underbrace{E[Y^1] - E[Y^0]}_{ \text{Average Treatment Effect}} \\ &+ \underbrace{E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]}_{ \text{Selection bias}} \\ & + \underbrace{(1-\pi)(ATT - ATU)}_{ \text{Heterogeneous treatment effect bias}} \end{aligned} \]

  • This gives us three important terms that are each important to doing any policy evaluation.
  • We know the first term is equal to -1.67
  • We know from above that the ATE is equal to 2.
  • The second term is the selection bias term.
  • It represents the inherent difference between the two groups if they had never received a treatment.
  • The first is counterfactual and the second is the observed outcome.
  • In this case, this quantity is: ?

Decomposing the Difference in Means

The quantity is:

\[ \frac{1}{3}(1+1+2) - \frac{1}{2}(6+8) = 4/3 - 7 = -5.667 \]

Decomposing the Difference in Means

  • The last term is the heterogeneous treatment effect bias and always present
  • The different returns to surgery for the two groups multiplied by the share of the population that didn’t receive the surgery.
  • (4-1)*.4 = 1.2

If we add these up, we get:

\[ -1.67 = 2 - 5.667 + 2 \]

Decomposing the Difference in Means

  • In real life though, we cannot observe the potential outcomes
  • We cannot calculate each of these terms because we do not have access to what would have happened
  • We need to think of strategies to take out these extra quantities so that we are left with the ATE.
  • More often than not, we need to figure out what needs to be true in order to use the simple difference to estimate the ATE.
  • This is in fact the fundamental problem of causal inference.

The Independence Assumption

  • It turns out that causal inference will turn on the ability to make a particular assumption.
  • This assumption is called the independence assumption.

\[ (Y^1,Y^0) \perp D \]

  • This means that the potential outcomes are independent of the treatment assignment.
  • This is a very strong assumption.
  • In the surgery example, this translates to the situation where the surgery was assigned in a way that had nothing to do with the gains to surgery.
  • This is violated by construction of the example.
  • A patient received the surgery if \(Y^1 > Y^0\) and did not receive the surgery if \(Y^1 < Y^0\).
    • Which is in fact the opposite: D very much depends on the potential outcomes.

The Independence Assumption

  • What if the doctor hadn’t done that?
  • It doesn’t necessarily mean they randomized the treatment.
  • It could mean that the doctor alphabetized the patients and then assigned the treatment based on that.
  • Or maybe the doctor assigned it based on the something else that had nothing to do with the treatment.
  • What would that mean?

The Independence Assumption

\[ E\big[Y^1\mid D=1\big] - E\big[Y^1\mid D=0\big]=0 \]

\[ E\big[Y^0\mid D=1\big] - E\big[Y^0\mid D=0\big]=0 \]

  • What does this mean?
  • It means that the average potential outcomes are the same for those who received the treatment and those who did not.
  • This zeroes out the selection bias term.
  • It also zeroes out the heterogeneous treatment effect bias term.
    • How?

\[ \begin{aligned} ATT = E\big[Y^1\mid D=1\big] - E\big[Y^0\mid D=1\big] \\ ATU = E\big[Y^1\mid D=0\big] - E\big[Y^0\mid D=0\big] \end{aligned} \]

This just leaves that the mean difference equals ATE.

The Independence Assumption

  • This, however, is not very realistic in the observational setting.
  • In the observational setting, we need to make assumptions about the data generating process.
  • This means that a simple comparison in means is no longer enough for a causal impact.

SUTVA

  • Stable Unit Treatment Value Assumption (SUTVA) is a way to think about the independence assumption.
  • Potential outcomes framework places limits on calculating treatment effects.
  • Each unit receives the same sized dose, no spillovers to other units’ potential outcomes when a unit is exposed to some treatment, and no general equilibrium effects.

Homogeneous Treatment

  • Treatment is received in homogeneous doses to all units.
    • It’s easy to imagine violations of this, though—for instance, if some doctors are better surgeons than others.

No Spillovers

  • Second, this implies that there are no externalities, because by definition, an externality spills over to other untreated units.
    • In other words, if unit 1 receives the treatment, and there is some externality, then unit 2 will have a different \(Y^0\) value than if unit 1 had not received the treatment.
  • When there are such spillovers, though, such as when we are working with social network data, we will need to use models that can explicitly account for such SUTVA violations

General Equilibrium

  • Let’s say we are estimating the causal effect of returns to schooling. The increase in college education would in general equilibrium cause a change in relative wages that is different from what happens under partial equilibrium.
  • This kind of scaling-up issue is of common concern when one considers extrapolating from the experimental design to the large-scale implementation of an intervention in some population.
Back to top