Multiple Regression

Author

Aleksandr Michuda

Agenda

  • Multiple Regression
  • Least Squares Estimation
  • Multiple Coefficients of Determination
  • Model Assumptions
  • Hypothesis Testing
  • Categorical Independent Variables

Why Multiple Regression?

  • We looked at single regression to look at the relationship between two variables.
  • Recall last week’s example:

\[ GPA = \beta_0 + \beta_1 \text{SAT} + \varepsilon \]

  • But certainly, SAT isn’t the only factor that affects GPA. What ACT score? Or whether you are an international student?
  • We can run separate regressions for each variable:

\[ GPA = \beta_0 + \beta_1 \text{SAT} + \varepsilon \]

\[ GPA = \beta_0 + \beta_1 \text{ACT} + \varepsilon \]

\[ GPA = \beta_0 + \beta_1 \text{International} + \varepsilon \]

But how do we know whether the effect of ACT isn’t coming from some common factor with SAT?

Multiple Regression

  • Multiple regression is an extension of simple linear regression to the case of two or more independent variables.
  • This allows to look at the relationship between the dependent variable and several independent variables.
  • The model is given by:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k + \varepsilon \]

Multiple Regression

  • The value of multiple regression is that in calculating the effect of on variable, we take into account its variability with respect to other variables.
  • We can “control” for the effect of other variables.
  • This is especially useful when we really have one variable of interest and other variables are “observables” that are important to control for.
  • For instance, if we wanted to understand the effect of SAT on GPA, we would want to control for ACT scores.

Multiple Regression

Least Squares Estimation

  • How do we actually find these new coefficients?
  • We use the same method as in simple regression: least squares estimation.
  • But because we are using multiple variables, this involves doing some matrix algebra.

\[ min \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 \]

  • In matrix form, the solution is given by:

\[ b = (X'X)^{-1}X'Y \]

\[ \left(X^{\prime} X\right)_{j k}=\sum_{i=1}^n x_{i j} x_{i k} \]

  • No need to worry about the details of this formula, but it’s good to know that it exists.
  • Note two things:
    • \(X'X\) is the variance-covariance matrix of the independent variables.
    • \(X'Y\) is the covariance between the dependent variable and the independent variables.
  • Similar to the single regression coefficient, but the solution here calculates the variance and covariance between the independent variables with the dependent variable AND the independent variables with each other.
  • This allows us to “control” for the effect of other variables.

Least Squares Estimation

See visual here

Least Squares Estimation

Interpretation of Coefficients

  • The interpretation of the coefficients is similar to that of simple regression.
  • But now we have to be careful about what the coefficients in relation to the other variables.

Interpretation of Coefficients

  • What is the effect of sat_total on GPA?
  • The effect is that a one unit increase in SAT leads to a 0.001 increase in GPA, holding all other variables constant.
  • Like a ceteris paribus condition.

Coefficient of Determination

\[ TSS = SSR + ESS \]

\[ \sum_{i=1}^{n} (Y_i - \bar{Y})^2 = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 + \sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2 \]

Coefficient of Determination

  • Just as before, we can calculate the coefficient of determination, \(R^2\)

\[ R^2 = \frac{SSR}{TSS} = 1 - \frac{ESS}{TSS} \]

Coefficient of Determination

  • But what happens to this simple formula when we add more variables?

  • The \(R^2\) will always increase when we add more variables, even if they are not significant.

Adjusted R-squared

\[ R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{(n - k - 1)} \]

Assumptions of Multiple Regression

  • A0: The relationship between all X’s and Y is linear
  • A1: \(E(\varepsilon) = 0\)
  • A2: \(E(\varepsilon_i |X_i, \ldots, X_k) = 0\)
  • A3: Homoscedasticity, \(Var(\varepsilon_i |X_i, \ldots, X_k) = \sigma^2\)
  • A4: No autocorrelation, \(Cov(\varepsilon_i, \varepsilon_j |X_i, \ldots, X_k) = 0\)
  • A5: \(\varepsilon_i \sim N(0, \sigma^2)\)
  • A6: No multicollinearity, an \(X\) cannot be expressed as a linear combination of other \(X\)’s

Hypothesis Testing: Individual Significance

  • Like in simple regression, we can test \(\beta_j = 0\) for each coefficient.

We reject the null hypothesis if:

\[ t_j = \frac{b_j}{SE(b_j)} > t_{n-k-1, \alpha/2} \]

where \(k\) is the number of independent variables and \(n\) is the number of observations.

Other Non-linearities

  • Not to be confused with non-linearity in the relationship between the dependent and independent variables.
  • We can include non-linearity in the X’s, but not in the effects!
  • For instance, we might think that wages rise with age, but they do so more steeply at younger ages and then reach some peak.
  • In that case, we can allow for that and test it, by including a quadratic term:

\[ wage = \beta_0 + \beta_1 age + \beta_2 age^2 + \varepsilon \]

  • what would be the partial effect of age on wage?

\[ \frac{\partial wage}{\partial age} = \beta_1 + 2 \beta_2 age \]

Non-linearity

Non-linearity: Dummy Variables

  • A dummy variable is a variable that takes the value of 1 or 0.

Ex: female = 1 if female, 0 if not

\[ lnwage = \beta_0 + beta_1 \cdot educ + \beta_2 \cdot female + \varepsilon_i \]

How do we interpret this coefficient?

\[ \beta_2 = \frac{\partial E(lnwage| educ,female)}{\partial female} \]

\[ \beta_2 = E(lnwage | educ, female=1) - E(lnwage | educ, female=0) \]

Recall that \(\beta_0\) is where all independent variables are 0. - So \(\beta_0 + \beta_2= E(lnwage | educ, female=1)\)

Categorical Variables as Dummy Variables

  • Sometimes we have categorical variables that take on more than two values.
  • This might be industry codes, educational attainment, or race.
  • In this case, it makes the most sense to create a dummy variable for each category and see the specific effect of each category.

Interpretation

  • The interpretation of the coefficients is similar to that of a regular dummy variable.

  • But can we include all categories?

  • We can’t include all categories because of multicollinearity.

  • If we include all categories and the categories are mutually exclusive and exhaustive, then we can express one category as a linear combination of the others.

Example:

Industries:

  1. Agriculture_dummy
  2. Manufacturing_dummy
  3. Services_dummy

We KNOW that:

\[ Agriculture_dummy + Manufacturing_dummy + Services_dummy = 1 \]

Interpretation

Interpretation

  • This is assumption A6. Stata does not know how to handle multicollinearity, so it will drop one of the categories.
  • It breaks down the ability to do the matrix algebra required for finding \(b\).
  • But what we happens when we drop a category is that the dropped category becomes the “base” category.
  • The coefficients of the other categories are interpreted as the difference between that category and the base category.
  • Think about that \(\beta_0\) is when all X’s are 0.
  • In this case \(\beta_0\) will be where all X’s are 0 and categorical variable is set at the base category.

Interpretation

  • So the interpretation of hsgrad an effect of .24 relative to finishing less than high school.

Interaction Variables

  • For our effect on female, the interpretation is that relative being not female, females earn around 27% less.
  • That is a level effect. But what if being female also changed the effect of another variable, like education.
  • Not only do women make less, on average, but they would also earn less from an extra year of education than men.
  • For this we can include an interaction term:

\[ lnwage = \beta_0 + \beta_1 \cdot educ + \beta_2 \cdot female + \beta_3 \cdot female * educ + \varepsilon_i \]

So the partial effect of education on wage is:

\[ \frac{\partial lnwage}{\partial education} = \beta_1 + \beta_3 \cdot female \]

So

\[ \beta_1 \text{ if female =0 } \]

\[ \beta_1 + \beta_3 \text{ if female=1} \]

To test whether the interaction is significant, we can test if \(\beta_3 = 0\).

Hypothesis Test on Several Coefficients

  • What is joint significance?
  • Essentially, a generalization of the t-test.
  • We can test whether a group of coefficients are equal to 0.

Example:

\[ H_0: \beta_2 =0 \text{ and } \beta_3 = 0 \]

\[ H_1: \beta_2 \neq 0 \text{ or } \beta_3 \neq 0 \]

  • NOT the same as testing each separately.
  • Why?

Hypothesis Test on Several Coefficients

  • Involves running two regressions:
  • One ASSUMING that the null hypothesis is true (the restricted regression)
  • The other is our regression (unrestricted regression)
  • Then “comparing” them by explanatory power.

Hypothesis Test on Several Coefficients

Hypothesis Test on Several Coefficients

  • Procedure
  1. Run the unrestricted regression and calculate \(R^2_{unrestricted}\)
  2. Run the restricted regression and calculate \(R^2_{restricted}\)
  3. Calculate the F-statistic:

\[ F = \frac{\frac{SSE_u - SSE_r}{m}}{\frac{SSE_r}{n - k - 1}} \]

The test asks if the unrestricted model “does a better job” of explaining the data than the restricted model.

If so, reject the null hypothesis.

F-Distribution

  • The F-distribution is a ratio of two chi-squared distributions.
  • The F-distribution is a right-skewed distribution.

Example

Back to the example

Relationship between F-test and \(R^2\)

We can rewrite the F-statistic as:

\[ F = \frac{(R^2_{unrestricted} - R^2_{restricted})/m}{(1 - R^2_{unrestricted})/(n - k - 1)} \]

Can we estimate multiple regression as simple regressions?

  • I wasn’t completely truthful before about simple vs. multiple regression.
  • We can estimate multiple regression as a series of simple regressions.
  • But we have to clever about how we do it.
  • Need to take into account the covariances between the independent variables.
  • This is called the Frisch-Waugh-Lowell theorem.

Frisch-Waugh-Lowell theorem

Let’s take a look at the following regression:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon \]

Let’s say that we wanted to estimate the effect of \(X_1\) on \(Y\).

  • We can run a regression of \(Y\) on \(X_2\) and get the residuals:

\[ Y = \alpha_0 + \alpha_1 X_2 + \varepsilon_{Y} \]

The residuals in this case are the part of \(Y\) that is not explained by \(X_2\). We have “purged” \(Y\) of the effect of \(X_2\).

  • We can also run a regression of \(X_1\) on \(X_2\) and get the residuals: \[ X_1 = \gamma_0 + \gamma_1 X_2 + \varepsilon_{X1} \]

The residuals in this case are the part of \(X_1\) that is not explained by \(X_2\). We have “purged” \(X_1\) of the effect of \(X_2\).

  • We can then run a regression of the residuals of \(Y\) on the residuals of \(X_1\):

\[ Y_{res} = \beta_0 + \beta_1 X_{1res} + \varepsilon \]

  • This is equivalent to running the original regression.

But!

  • The coefficients will be the same, but the standard errors will be different.
  • This is because the residuals are not independent of each other.
  • The standard errors will be larger because the residuals are correlated.

Nice Trick but… why?

  • This is a nice trick, but why would we want to do this?
  • With multiple regression, it becomes difficult to visualize the relationship between the dependent and independent variables.
  • Simple regression is a 2D plot.
  • With FWL, we can visualize the relationship between the dependent variable and one independent variable, while controlling for the other independent variables.

Example

Do-file Example

Back to top