EC031-S26
Note: The problems may differ based on the edition of the textbook you have.
In econometrics, we are often very concerned with estimators that are unbiased (almost to a fault). However, (and this is true with IV), unbiased estimators come at a cost: increased variance. Discuss why this might be an issue, especially when it comes to hypothesis testing.
A researcher is investigating the effect of study hours on college grades. They are concerned that study hours might be correlated with unobserved factors, such as motivation, which could bias the estimates. To address this, the researcher proposes using “campus library opening hours as an instrumental variable for study hours.
Explain why “campus library opening hours” might be a valid instrumental variable for study hours. Discuss the two key assumptions for IV validity: relevance and exogeneity.
If the researcher decided to estimate a linear regression model of college GPA on study hours directly, what might be the consequences of omitted variable bias? Would the coefficient on study be biased up or down?
Suppose you have data on all U.S. public schools and are interested in running a regression of a school’s average test scores (\(Y\)) on the number of teachers at the school (\(X\)).
\[ \text{Test scores}_i=\beta_0+\beta_1\text{Number of teachers}_i+\varepsilon_i \]
Explain what heteroskedasticity means (which of our CLRM assumptions is violated?), and why you might be worried about it in this model.
Suppose you run OLS, ignoring the possibility of heteroskedasticity. What is the consequence for your OLS estimate of \(\beta_1\)?
If the variance of the error terms increases with the number of teachers, will your OLS estimate, \(b_1\), be larger or smaller than the true population parameter, \(\beta_1\)?
What visual/graphical diagnostic approach could you use to detect heteroskedasticity?
What is the difference between statistically significant and economically significant? Give an example of a statistically significant result that is not economically significant, and vice versa. How is this related to the idea of a “precise zero”?
Suppose we need to measure the impact of a training program on income six months after the training program. Suppose we decide to design an experiment where we randomly assign individuals to either receive the training program or not. We have access to several variables that describe the individuals in the sample:
age: age of the individualyears_work: number of years workingeducation: number of years of educationSuppose that the data-generating process is that income is a function of training, age, years of work, and education:
\[ \text{Income} = \beta_0 + \beta_1 \text{Training} + \beta_2 \text{Age} + \beta_3 \text{Years of work} + \beta_4 \text{Education} + \varepsilon \]
\[ \text{Income} = \hat{\beta_0} + \hat{\beta_1} \text{Training} + \varepsilon \]
rather than the full model. How would this affect the coefficient on Training? Would it be biased?
Generate a variable training that randomly assigns 0 or 1 to each observation with equal probability. First create a variable that creates uniform random variables between 0 and 1:
Then create the training variable:
Why does this code generate a random variable between 0 and 1 with equal probability?
Run the regression:
What is the coefficient?
Now do it with the full model:
What is the coefficient on training now?
clear all
mat beta_mat = J(1000, 1, .)
forval i=1/1000 {
set obs 1000
gen age = rnormal(30, 5)
gen years_work = rnormal(10, 3)
gen education = rnormal(12, 2)
gen rand_num = runiform(0,1)
gen training = rand_num < 0.5
gen income = 1000 + 50*education + 100*years_work + 10*age + 500*training + rnormal(0, 1)
reg income training
mat beta_mat[`i', 1] = _b[training]
drop age years_work education rand_num training income
}
svmat beta_mat
summarize beta_mat1