Problems with Regression
Agenda
- Multicollinearity
- Including an irrelevant variable
- Heteroskedasticity (Violating A3)
- Serial correlation (Violating A4)
Multicollinearity
As discussed earlier, multicollinearity can come in two ways:
Perfect multicollinearity: when one variable is a perfect linear function of another
- This is easily fixed, REMOVE one of the variables
- Stata will just drop one of the variables
Imperfect multicollinearity: when two or more variables are highly correlated, but not perfectly so
- This is harder
- Stata will still run the regression, but the coefficients will be less precise
- This is a problem because it makes it hard to tell which variable is actually affecting the dependent variable
Philadelphia residents spending on monthly rent
Model:
\[ \text{rent} = \beta_0 + \beta_1 \text{hhsize} + \beta_2 \text{monthlyincome} + \beta_3 \text{monthlyearnings} + \varepsilon_i \]
What happens?
If we think that monthly earnings and monthly income are highly correlated, we can first check their correlation

- Correlation is more than 0.7.
- Rule of thumb: if correlation is more than 0.7/0.8, we should be careful.
- We can also check the variance inflation factor (VIF) to see if we have multicollinearity (not considered here)
- Variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own
Multicollinearity
But what if we didn’t know that?
What might happen? Weird things/signs:
Large standard errors on variables that “should matter”
Like really high standard errors
Weird/unexpected signs on coefficients
When you estimate the model without one of them, the \(R^2\) barely falls at all
Coefficients/standard errors look as expected.


Including an irrelevant variable
Suppose the population model is:
\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i \]
But I estimate:
\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \varepsilon_i \]
where \(x_3\) is irrelevant to the model.
What happens?
- The OLS estimator is still unbiased
- \(R^2\) will be higher (not adjusted \(R^2\))
- Adding a variable increases \(k\), so your degrees of freedom go down and increase MSE
However:
- The variances of estimates of \(\beta_1\) and \(\beta_2\) will be larger than they would have been if you had not included \(x_3\).
- Makes it harder to reject the null hypothesis
Using \(R^2\) to test for inclusion of a variable
- If you have a model with \(k\) variables, and you add a variable, the \(R^2\) will always increase
- Recall that \(R^2\) will always increase with the inclusion of a variable, but \(R^2_{adj}\) won’t
- If \(R^2_{adj}\) decreases with the inclusion of a variable, then that variable is likely irrelevant
- The variables you just added cost you more in terms of precision than they give you in terms of explanatory power
Heteroskedasticity
- When assumption A3 is violated, we have heteroskedasticity.

- This means that the variance of the error term is not constant across all values of \(x\).
- There is some systematic relationship between the variance of the error term and the independent variable.
Heteroskedasticity
If researcher uses OLS to estimate the model, and doesn’t realize there is heteroskedasticity - The OLS estimator is still unbiased - But OLS is not longer BLUE - So standard errors are no longer minimum-variance estimates for the coefficients - Our estimates standard errors will overstate the true standard errors and undestate it in others.
Which is worse?
Heteroskedasticity
- When understated…
- Why?
- We are more likely to reject the null hypothesis when we shouldn’t
\[ |\frac{b_i}{s_{b_i}|} > t_{\text{critical value}} \]
The Goldfeld-Quandt test
- The Goldfeld-Quandt test is a test for heteroskedasticity
- The null hypothesis is that the variance of the error term is constant across all values of \(x\).
- The alternative hypothesis is that the variance of the error term is not constant across all values of \(x\).
- The test statistic is calculated as:
\[ F = \frac{MSR_1}{MSR_2} \]
where \(MSE_1\) is the mean square residual for the first half of the data and \(MSE_2\) is the mean square residual for the second half of the data. - If the test statistic is greater than the critical value from the F-distribution, we reject the null hypothesis and conclude that there is heteroskedasticity in the data.
Procedure
- Sort the data by the independent variable
- Split the data into two groups
- Estimate the model for each group
- Calculate the mean square residual for each group
- Calculate the test statistic
- Compare the test statistic to the critical value from the F-distribution
Solutions
- Typing
, robustafter the regression command- These calculate “heteroskedasticity-robust standard errors”
- “Weighted Least Squares”
- Assigns lower weight to observations with larger error variances
- Essentially the observations that don’t contains as much information
- Redefine the Variables
- Using rates, per-capital values, or logs as the dependent variable
Redefining the Variables
- If the variance of the error term is not constant across all values of \(x\), we can redefine the variables to make the variance constant.
- For example, if the variance of the error term is proportional to the square of the independent variable, we can take the square root of the independent variable and use that as the new independent variable.
- This will make the variance of the error term constant across all values of the new independent variable.
Serial correlation
- Also known as autocorrelation
- When the error term is correlated with itself over time or space
- Recall Assumption A4:
\[ Cov(\varepsilon_i, \varepsilon_j| x_1, x_2, \ldots, x_k) = 0 \text{ for all } i \neq j \]
Serial Correlation
