Problem Set 6

EC031-S26

Aleksandr Michuda

Note: The problems may differ based on the edition of the textbook you have.

Problem 1

In econometrics, we are often very concerned with estimators that are unbiased (almost to a fault). However, (and this is true with IV), unbiased estimators come at a cost: increased variance. Discuss why this might be an issue, especially when it comes to hypothesis testing.

Problem 2

A researcher is investigating the effect of study hours on college grades. They are concerned that study hours might be correlated with unobserved factors, such as motivation, which could bias the estimates. To address this, the researcher proposes using “campus library opening hours as an instrumental variable for study hours.

Explain why “campus library opening hours” might be a valid instrumental variable for study hours. Discuss the two key assumptions for IV validity: relevance and exogeneity.
If the researcher decided to estimate a linear regression model of college GPA on study hours directly, what might be the consequences of omitted variable bias? Would the coefficient on study be biased up or down?

Problem 3

Suppose you have data on all U.S. public schools and are interested in running a regression of a school’s average test scores (\(Y\)) on the number of teachers at the school (\(X\)).

\[ \text{Test scores}_i=\beta_0+\beta_1\text{Number of teachers}_i+\varepsilon_i \]

Explain what heteroskedasticity means (which of our CLRM assumptions is violated?), and why you might be worried about it in this model.
Suppose you run OLS, ignoring the possibility of heteroskedasticity. What is the consequence for your OLS estimate of \(\beta_1\)?
If the variance of the error terms increases with the number of teachers, will your OLS estimate, \(b_1\), be larger or smaller than the true population parameter, \(\beta_1\)?
What visual/graphical diagnostic approach could you use to detect heteroskedasticity?

Problem 4

What is the difference between statistically significant and economically significant? Give an example of a statistically significant result that is not economically significant, and vice versa. How is this related to the idea of a “precise zero”?

Application with Stata

Suppose we need to measure the impact of a training program on income six months after the training program. Suppose we decide to design an experiment where we randomly assign individuals to either receive the training program or not. We have access to several variables that describe the individuals in the sample:

age: age of the individual
years_work: number of years working
education: number of years of education

Suppose that the data-generating process is that income is a function of training, age, years of work, and education:

\[ \text{Income} = \beta_0 + \beta_1 \text{Training} + \beta_2 \text{Age} + \beta_3 \text{Years of work} + \beta_4 \text{Education} + \varepsilon \]

Suppose we instead decide to run the following regression:

\[ \text{Income} = \hat{\beta_0} + \hat{\beta_1} \text{Training} + \varepsilon \]

rather than the full model. How would this affect the coefficient on Training? Would it be biased?

Let’s test it. Open up a do-file and run the following code:

clear all
set more off
set obs 1000
gen age = rnormal(30, 5)
gen years_work = rnormal(10, 3)
gen education = rnormal(12, 2)

Generate a variable training that randomly assigns 0 or 1 to each observation with equal probability. First create a variable that creates uniform random variables between 0 and 1:

gen rand_num = runiform(0,1)

Then create the training variable:

gen training = rand_num < 0.5

Why does this code generate a random variable between 0 and 1 with equal probability?

Now we can generate our outcome, income. Let’s say it’s like this:

gen income = 1000 + 50*education + 100*years_work + 10*age + 500*training + rnormal(0, 1)

Run the regression:

reg income training

What is the coefficient?

Now do it with the full model:

reg income training age years_work education

What is the coefficient on training now?

Run the below code that runs these regressions in a loop 1000 times and takes the mean of the estimates of \(\beta_1\). Is it close to the true value of 500? What does this say about whether \(\beta_1\) is biased?

clear all
mat beta_mat = J(1000, 1, .)
forval i=1/1000 {
    set obs 1000
    gen age = rnormal(30, 5)
    gen years_work = rnormal(10, 3)
    gen education = rnormal(12, 2)
    gen rand_num = runiform(0,1)
    gen training = rand_num < 0.5
    gen income = 1000 + 50*education + 100*years_work + 10*age + 500*training + rnormal(0, 1)
    reg income training
    mat beta_mat[`i', 1] = _b[training]
    drop age years_work education rand_num training income
}

svmat beta_mat
summarize beta_mat1