Sample Selection Bias

Author

Aleksandr Michuda

Agenda

First Presentation!
Evaluations
What is sample selection bias and why is it similar to selection bias from potential outcomes?
How can we relate sample selection bias to an omitted variable bias problem?

Motivating Example

Data on California Schools
Academic Performance Index (API) scores vs. Free School Lunch Eligibility
Negative slope, but also not a causal relationship
Let’s say this is the full population
I want to randomly sample a subset of schools to do an analysis
But let’s say I couldn’t randomly sample schools
I had to send out a survey to schools and they would decide whether or not to respond

Selecting Schools based on College Graduates

Let’s say I sent out a survey to schools in the name of the California Department of Education with the aim being that funding would be given based on need to schools who responded and who had a low number of parents who finished college.
What would happen?

Selecting Schools based on API

Or perhaps, I sent out a survey to schools about extra funding to high-performing schools

Selecting Based on FLE

What about if I sent it out based on neediness in the free lunch program?

Sample Selection Bias

In each of these cases, voluntary response meant that I was getting a sample from different parts of the population.
In the same way, in potential outcomes, the treatment and control groups are not comparable.
Each is coming from a different subset of the population.
This is sample selection bias.
Any kind of selection bias can be understood as a kind of sample selection bias coming from a non-representative sample.

Sample Selection Bias and Omitted Variable Bias

Let’s take a simple model:

\[ API_i = \beta_0 + \beta_1 FLE_i + \varepsilon_i \]

Non-representative sampling, as it turns out, causes bias in the slope parameters of your regression.

\[ cov(FLE_i, \varepsilon_i) \neq 0 \]

Your population of interest and your sample of interest are from different populations

Two populations

Your population model is this:

\[ API_i = \beta_0 + \beta_1 FLE_i + \varepsilon_i \]

For $b_1$ to be unbiased, we need $cov(FLE_i, \varepsilon_i) = 0$

But let’s say your survey only got you back responses from those that had low parental education levels.

The population model of that sample is this:

\[ API_i = \beta_0^{low} + \beta_1^{low} FLE_i + \varepsilon_i^{low} \]

For $b_1^{low}$ to be unbiased, we need $cov(FLE_i, \varepsilon_i) = 0$

Two Populations

Since these are both regressions of $API$, we can equate them.

\[ \beta_0 + \beta_1 FLE_i + \varepsilon_i = \beta_0^{low} + \beta_1^{low} FLE_i + \varepsilon_i^{low} \]

And solve for $\varepsilon_i$

\[ \varepsilon_i = \beta_0^{low} - \beta_0 + (\beta_1^{low} - \beta_1) FLE_i + \varepsilon_i^{low} \]

If $\beta_1^{low} \neq \beta_1$, then $cov(FLE_i, \varepsilon_i) \neq 0$ and we have bias in our estimate of $b_1$.

Return to ../figures

Selecting with low parental education levels –> bias

Return to ../figures

Selecting with high API –> bias

Return to ../figures

Selecting with high FLE –> no bias!
Why?

This is really the same thing as you wanting to look at the effect of a covariate, but you are still sampling from the same population.
You are still getting ALL API scores, not just the ones that are high or low.
Selection bias can also be seen as a missing data problem.

Solutions to Sample Selection Bias

Random sampling
Can you use an instrumental variable?
- It would be tough.
- You would need an instrument that influenced the selection of the outcome, so may not be that easy
Or with some more assumptions, you can use a Heckman correction.

Motivating Example

Mexican migrants to the US send more than 25billion dollars back to Mexico each year in the form of remittances.
What determines the amount of remittances sent?
Education, family size, gender, work experience?
Let’s say we knew the factors that affected the amount of remittances sent.

Model with Selection Bias

Our population model in this case would be:

\[ R = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \varepsilon_i \]

Potential Selection Problem:
- We only observe remittances for those who migrated
- May not be representative of future migrants
- We don’t know how much remittances would be for those who didn’t migrate
- We don’t know how much remittances would be for those who did migrate but didn’t send remittances
Important Question:
What is our population of interest?
- Mexicans already in the US? Not a problem
- Potential Migrants? Problem

Some Modeling

What do we actually observe?
Let’s say we had data on everyone in Mexico

\[ R_i = \begin{cases} \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 & \text{if } Migrated \\ 0 & \text{if Not Migrated} \end{cases} \]

We don’t observe the remittances from those that don’t migrate, so they just get a 0.
But let’s say we knew the factors that affected the probability of migration.
Education, family size, gender, work experience?
- Maybe also, income needs like desire or previous purchase of a durable good?
Let’s say that $I$ is a binary variable for the decision to migrate. We can model the decisions like so:

\[ I_i = \alpha_0 + \alpha_1Z_1 + \alpha_2Z_2 + \alpha_3Z_3 + +\alpha_4 X_1 + \alpha_2 X_2 + \alpha_3 X_3 + \nu_i \]

Remember that since $I$ is a binary variable, the linear regression here would be a linear probability model.
So the predicted values of $I$, $\hat{I}$, would be the probability of migration.

Simple Heckman Correction

A simple Heckman correction would be calculate the $\hat{I}$ and then use it as a regressor in the original model.
So we would have:

\[ R_i = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4\hat{I} + \varepsilon_i \]

$\hat{I}$ acts as a “sponge” that soaks up the bias caused by the selection.

The Migrant Model

The Remittance Model

People with migrants in their family are more likely to migrate
Males are also more likely to migrate.
Without correcting for selection, males remit $968 more than females on average.
- Likely migrants tend to remit less
- Randomly selected migrants would not have this family history would remit more on average
Education and work experience don’t affect remittances significantly.

How does this relate to Potential Outcomes?

Let’s get back to potential outcomes.
The selection bias problem we saw there is the same as the one we saw here.
We only observe those that received the treatment and those that don’t, but not their counterfactuals.

Back to the Switching Equation

\[ Y_i = D_i Y_i^1(1) + (1 - D_i) Y_i^0(0) \]

If we do some math, this gets us a regression:

\[ Y_i = Y_i^0 + D_i (Y_i^1 - Y_i^0) + \varepsilon_i \]

If we run the following regression:

\[ Y_i = \beta_0 + \beta_1 D_i + \varepsilon_i \]

$\beta_1$ is estimating the simple difference in means.
But if we have selection bias, then when we run this regression, the sample selection term goes into the error term.

\[ Y_i = \beta_0 + ATE \cdot D_i + SampleSelection + \varepsilon_i \]

Can we fix this?

We can use the same Heckman correction we used before.
If we knew the factors that affected the probability of treatment, we could use them to correct for selection bias.
We could use the predicted values of the treatment as a regressor in the original model.
So we would have:

\[ Y_i = \beta_0 + \beta_1 D_i + \beta_2 \hat{I} + \varepsilon_i \]

You can think of $\hat{I}$ as a kind of proxy for the selection bias.
This is a very simply way to implement this, but there are more complicated ways to do this that account for standard errors and other things.
If we had random assignment to treatment, what would be the hypothesis test of testing the significance of $\hat{I}$?
This is also the first step into the wonderful world of matching.

Other Formats

Agenda

Motivating Example

Selecting Schools based on College Graduates

Selecting Schools based on API

Selecting Based on FLE

Sample Selection Bias

Sample Selection Bias and Omitted Variable Bias

Two populations

Two Populations

Return to ../figures

Return to ../figures

Return to ../figures

Solutions to Sample Selection Bias

Motivating Example

Model with Selection Bias

Some Modeling

Simple Heckman Correction

The Migrant Model

The Remittance Model

How does this relate to Potential Outcomes?

Back to the Switching Equation

Can we fix this?