Problems with Regression

Author

Aleksandr Michuda

Agenda

Multicollinearity
Including an irrelevant variable
Heteroskedasticity (Violating A3)
Serial correlation (Violating A4)

Multicollinearity

As discussed earlier, multicollinearity can come in two ways:
Perfect multicollinearity: when one variable is a perfect linear function of another
- This is easily fixed, REMOVE one of the variables
- Stata will just drop one of the variables
Imperfect multicollinearity: when two or more variables are highly correlated, but not perfectly so
- This is harder
- Stata will still run the regression, but the coefficients will be less precise
- This is a problem because it makes it hard to tell which variable is actually affecting the dependent variable

Philadelphia residents spending on monthly rent

Model:

\[ \text{rent} = \beta_0 + \beta_1 \text{hhsize} + \beta_2 \text{monthlyincome} + \beta_3 \text{monthlyearnings} + \varepsilon_i \]

What happens?

If we think that monthly earnings and monthly income are highly correlated, we can first check their correlation

Correlation is more than 0.7.
Rule of thumb: if correlation is more than 0.7/0.8, we should be careful.
We can also check the variance inflation factor (VIF) to see if we have multicollinearity (not considered here)
Variance of a parameter estimate when fitting a full model that includes other parameters to the variance of the parameter estimate if the model is fit with only the parameter on its own

Multicollinearity

But what if we didn’t know that?
What might happen? Weird things/signs:
Large standard errors on variables that “should matter”
Like really high standard errors
Weird/unexpected signs on coefficients
When you estimate the model without one of them, the \(R^2\) barely falls at all
Coefficients/standard errors look as expected.

Including an irrelevant variable

Suppose the population model is:

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \varepsilon_i \]

But I estimate:

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \varepsilon_i \]

where \(x_3\) is irrelevant to the model.

What happens?

The OLS estimator is still unbiased
\(R^2\) will be higher (not adjusted \(R^2\))
Adding a variable increases \(k\), so your degrees of freedom go down and increase MSE

However:

The variances of estimates of \(\beta_1\) and \(\beta_2\) will be larger than they would have been if you had not included \(x_3\).
Makes it harder to reject the null hypothesis

Using \(R^2\) to test for inclusion of a variable

If you have a model with \(k\) variables, and you add a variable, the \(R^2\) will always increase
Recall that \(R^2\) will always increase with the inclusion of a variable, but \(R^2_{adj}\) won’t
If \(R^2_{adj}\) decreases with the inclusion of a variable, then that variable is likely irrelevant
- The variables you just added cost you more in terms of precision than they give you in terms of explanatory power

Heteroskedasticity

When assumption A3 is violated, we have heteroskedasticity.

This means that the variance of the error term is not constant across all values of \(x\).
There is some systematic relationship between the variance of the error term and the independent variable.

Heteroskedasticity

If researcher uses OLS to estimate the model, and doesn’t realize there is heteroskedasticity - The OLS estimator is still unbiased - But OLS is not longer BLUE - So standard errors are no longer minimum-variance estimates for the coefficients - Our estimates standard errors will overstate the true standard errors and undestate it in others.

Which is worse?

Heteroskedasticity

When understated…
Why?
We are more likely to reject the null hypothesis when we shouldn’t

\[ |\frac{b_i}{s_{b_i}|} > t_{\text{critical value}} \]

The Goldfeld-Quandt test

The Goldfeld-Quandt test is a test for heteroskedasticity
The null hypothesis is that the variance of the error term is constant across all values of \(x\).
The alternative hypothesis is that the variance of the error term is not constant across all values of \(x\).
The test statistic is calculated as:

\[ F = \frac{MSR_1}{MSR_2} \]

where \(MSE_1\) is the mean square residual for the first half of the data and \(MSE_2\) is the mean square residual for the second half of the data. - If the test statistic is greater than the critical value from the F-distribution, we reject the null hypothesis and conclude that there is heteroskedasticity in the data.

Procedure

Sort the data by the independent variable
Split the data into two groups
Estimate the model for each group
Calculate the mean square residual for each group
Calculate the test statistic
Compare the test statistic to the critical value from the F-distribution

Solutions

Typing , robust after the regression command
- These calculate “heteroskedasticity-robust standard errors”
“Weighted Least Squares”
- Assigns lower weight to observations with larger error variances
- Essentially the observations that don’t contains as much information
Redefine the Variables
- Using rates, per-capital values, or logs as the dependent variable

Redefining the Variables

If the variance of the error term is not constant across all values of \(x\), we can redefine the variables to make the variance constant.
For example, if the variance of the error term is proportional to the square of the independent variable, we can take the square root of the independent variable and use that as the new independent variable.
This will make the variance of the error term constant across all values of the new independent variable.

Serial correlation

Also known as autocorrelation
When the error term is correlated with itself over time or space
Recall Assumption A4:

\[ Cov(\varepsilon_i, \varepsilon_j| x_1, x_2, \ldots, x_k) = 0 \text{ for all } i \neq j \]

Other Formats

Problems with Regression

Agenda

Multicollinearity

Philadelphia residents spending on monthly rent

What happens?

Multicollinearity

Including an irrelevant variable

Using \(R^2\) to test for inclusion of a variable

Heteroskedasticity

Heteroskedasticity

Heteroskedasticity

The Goldfeld-Quandt test

Procedure

Solutions

Redefining the Variables

Serial correlation

Serial Correlation