Regression II

Aleksandr Michuda

Group Outlines Due

  • Project Outlines due April 3rd
  • Structure:
    • Introduction and Motivation
    • Data
      • How easily available is it?
      • Did you decide to use an alternative dataset?
    • Methods
      • What does the paper use?
      • How available is replication code?
      • Describe what the method is doing
    • Additions/Extensions

Agenda

  • Sampling Distribution of \(b_1\)
  • Confidence Intervals in Regression
  • Hypothesis Testing in Regression
  • Residual Analysis
  • Validating Assumptions
  • Outliers and Influential Variables

Review

\[ b_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

  • If we can show that \(b_1\) is normally distributed
  • Then we can standardize \(b_1\) for use in confidence intervals and hypothesis tests as long as we know its expectation and variance.

The Mean and Variance of \(b_1\)

  • We know from previous lectures that

\[ E(b_1) = \beta_1 \]

  • Now what is the variance of the estimator?

\[ Var(b_1) = \frac{\sigma^2_{\varepsilon}}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

Unknown \(\sigma^2_{\varepsilon}\)

  • Once again, since we don’t know \(\sigma^2_{\varepsilon}\)
  • Once again we use the sample analog
  • For this we can use

\[ s^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-2} \]

So the standard error of \(b_1\) is:

\[ s_{b_1} = \sqrt{\frac{\frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-2}}{\sum_{i=1}^n (x_i - \bar{x})^2}} \]

How is \(b_1\) Distributed?

  • Let’s take a look:

\[ b_0 = \bar{y} - b_1\bar{x} \]

\[ b_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})y_i}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

  • … and expand \(y_i\) using the population model (remember the trick about centering \(y_i\)):

\[ = \frac{\sum_{i=1}^n (x_i - \bar{x})(\beta_0 + \beta_1x_i + \varepsilon_i)}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

\[ = \beta_1 + \frac{\sum_{i=1}^n (x_i - \bar{x})\varepsilon_i}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

Distribution of \(b_1\)

  • By A5, we know that:

\[ \varepsilon_i | x_i \sim N(0, \sigma^2_{\varepsilon}) \]

  • If we assume that \(x_i\) are fixed, then A5 is really just:

\[ \varepsilon_i \sim N(0, \sigma^2_{\varepsilon}) \]

  • Then \(b_1\) is a linear combination of normally distributed random variables, and thus is also normally distributed.

\[ b_1 \sim N\left(\beta_1, \frac{\sigma^2_{\varepsilon}}{\sum_{i=1}^n (x_i - \bar{x})^2}\right) \]

Confidence Intervals

Then the \(1-\alpha\) confidence interval for \(b_1\) is:

\[ b_1 \pm t_{\alpha/2, n-2}s_{b_1} \]

And the two-sided hypothesis test is then:

\[ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0 \]

We reject if:

\[ |\frac{b_1 - \beta_1}{s_{b_1}}| > t_{\alpha/2, n-2}s_{b_1} \]

Interpreting Coefficients

  • The coefficient \(b_1\) is the expected change in \(y\) for a one unit change in \(x\).

\[ b_1 = \frac{\Delta y}{\Delta x} \]

  • Depending on \(y\) and \(x\) this will have different interpretations.

Example


. 
. regress y X

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(1, 998)       =      9.38
       Model |  9.27810127         1  9.27810127   Prob > F        =    0.0022
    Residual |   986.85677       998  .988834439   R-squared       =    0.0093
-------------+----------------------------------   Adj R-squared   =    0.0083
       Total |  996.134871       999  .997132003   Root MSE        =     .9944

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           X |   .0009776   .0003192     3.06   0.002     .0003513    .0016039
       _cons |   6.397407   .3834314    16.68   0.000     5.644982    7.149831
------------------------------------------------------------------------------

. 

Logarithmic Transformations

  • Logarithmic transformations can be useful in regression.
    • In some cases you may have a non-linear relationship between \(X\) and \(y\)
      • Helps with skewed data
    • If data is non-linear we might want to linearize it
    • In some cases you may want to interpret changes in percentage terms
  • For instance, if we have a model:

\[ y = \beta_0 + \beta_1 X + \varepsilon \]

  • We can take the natural logs of either \(y\) or \(X\)

  • The interpretation of the coefficients change, and can be interpreted as percentage changes.

  • This is because for small changes, logging approximates percentage changes

Log Transformation of \(y\)

  • If we have a model:

\[ log(y) = \beta_0 + \beta_1X + \varepsilon \]

  • Then the interpretation of \(\beta_1\) is:

\[ \frac{\Delta \log(y)}{\Delta X} \approx \frac{\% \Delta y}{\Delta X} \]

Example


. 
. gen log_y = log(y)

. regress log_y X

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(1, 998)       =     10.22
       Model |  .186858023         1  .186858023   Prob > F        =    0.0014
    Residual |  18.2403713       998  .018276925   R-squared       =    0.0101
-------------+----------------------------------   Adj R-squared   =    0.0091
       Total |  18.4272293       999  .018445675   Root MSE        =    .13519

------------------------------------------------------------------------------
       log_y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
           X |   .0001387   .0000434     3.20   0.001     .0000536    .0002239
       _cons |   1.848797   .0521288    35.47   0.000     1.746503    1.951092
------------------------------------------------------------------------------

. 

Log Transformation of \(X\)

  • If we have a model:

\[ y = \beta_0 + \beta_1\log(X) + \varepsilon \]

  • Then the interpretation of \(\beta_1\) is:

\[ \frac{\Delta y}{\Delta \log(X)} \approx \frac{\Delta y}{\% \Delta X} \]

Example


. 
. gen log_X = log(X)

. regress y log_X

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(1, 998)       =      9.89
       Model |  9.77239463         1  9.77239463   Prob > F        =    0.0017
    Residual |  986.362477       998  .988339155   R-squared       =    0.0098
-------------+----------------------------------   Adj R-squared   =    0.0088
       Total |  996.134871       999  .997132003   Root MSE        =    .99415

------------------------------------------------------------------------------
           y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       log_X |    1.19116   .3788109     3.14   0.002     .4478024    1.934517
       _cons |  -.8707322   2.683844    -0.32   0.746    -6.137357    4.395893
------------------------------------------------------------------------------

. 

Log Transformation of \(X\) and \(y\)

  • If we have a model:

\[ \log(y) = \beta_0 + \beta_1\log(X) + \varepsilon \]

  • Then the interpretation of \(\beta_1\) is:

\[ \frac{\Delta \log(y)}{\Delta \log(X)} \approx \frac{\% \Delta y}{\% \Delta X} \]

  • This is also known as the elasticity of \(y\) with respect to \(X\).
  • This is equivalent to the elasticity that you learned about in EC1.

Example


. 
. regress log_y log_X

      Source |       SS           df       MS      Number of obs   =     1,000
-------------+----------------------------------   F(1, 998)       =     10.74
       Model |   .19621207         1   .19621207   Prob > F        =    0.0011
    Residual |  18.2310173       998  .018267552   R-squared       =    0.0106
-------------+----------------------------------   Adj R-squared   =    0.0097
       Total |  18.4272293       999  .018445675   Root MSE        =    .13516

------------------------------------------------------------------------------
       log_y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       log_X |   .1687844   .0515003     3.28   0.001     .0677231    .2698457
       _cons |   .8191735   .3648753     2.25   0.025     .1031627    1.535184
------------------------------------------------------------------------------

. 

Binary Variables as Independent Variables

  • In some cases, \(X\) is a categorical variable.
    • So 1 if male, or 0 if female
    • 1 if treated, 0 if control
    • 1 if college educated, 0 if not
    • etc…
  • What is the interpretation of \(b_1\) in this case?

\[ y = \beta_0 + \beta_1 Female + \varepsilon \]

Interpretation

  • Let’s take expectations to find out:

\[ E(y|Female=1) = \beta_0 + \beta_1 \]

\[ E(y|Female=0) = \beta_0 \]

So \(\beta_1\) is:

\[ \beta_1 = E(y|Female=1) - E(y|Female=0) \]

  • What is the interpretation of \(\beta_0\)?

Encoding Categorical/Binary Variables

  • Sometimes, we can also encode a continuous variable as a categorical variable.
  • Say you want to look at the effect of drought on health outcomes.
    • Usually, drought is measured as a continuous variable
    • Many different ways to measure drought severity that includes variables on rainfall, temperature, vegetation indices, etc…
    • Let’s take a drought index that only incorporates rainfall
    • The SPI

The SPI

  • The Standardized Precipitation Index (SPI) is a commonly used index to measure drought severity.
  • The SPI is calculated by fitting a probability distribution to historical precipitation data for a location, and then transforming the data to a standard normal distribution.
  • The SPI can be calculated for different time scales, such as 1 month, 3 months, 6 months, etc…
  • The SPI values can then be categorized into different drought severity levels.

Source: https://climatedataguide.ucar.edu/

Using Categorical/Binary Variables in Regression

  • The SPI ranges from approximately -3 to +3.

  • Let’s say you want to understand the effect of drought severity on malnutrition.
  • What would using the SPI directly look like?

\[ MN_i = \beta_0 + \beta_1 SPI_i + \varepsilon_i \]

  • What would the interpretation of \(\beta_1\) be?
  • Does this tell you anything about drought?

Using Categorical/Binary Variables in Regression

  • Instead, you can create categorical variables for drought severity.
  • You can say that if SPI < -2, then extreme drought
  • So generate a variable that is 1 if extreme drought, 0 otherwise.
SPI Drought
-2.5 1
1 0
-1 0
1.2 0
3 0

\[ MN_i = \beta_0 + \beta_1 Drought_i + \varepsilon_i \]

Now what is the interpretation of \(\beta_1\)?

\[ \beta_1 = E(MN|Drought=1) - E(MN|Drought=0) \]

Binary Variables as the Dependent Variable

  • What if the dependent variable is binary?
    • 1 if employed, 0 if not
    • 1 if college educated, 0 if not
    • 1 if voted, 0 if not
  • Can we still use linear regression?
  • Yes, but it would now be called a Linear Probability Model (LPM).

Binary Dependent Variable

  • Let’s say you wanted to model employment status:

\[ Employed_i = \beta_0 + \beta_1 CollegeEducated_i + \varepsilon_i \]

What’s the conditional expectation function of this model?

\[ E(Employed|CollegeEducated) = \beta_0 + \beta_1 CollegeEducated \]

  • Since Employed is binary, what does \(E(Employed|CollegeEducated)\) represent?
  • An easy way to see this is by using the

\[ E(Employed|CollegeEducated) = \sum_{j=0}^1 P(Employed=j|CollegeEducated) \cdot j \]

\[ = P(Employed=1|CollegeEducated) \cdot 1 + P(Employed=0|CollegeEducated) \cdot 0 \]

\[ = P(Employed=1|CollegeEducated) \]

  • So the conditional expectation function is the probability of being employed given college education.
  • So the interpretation of \(\beta_1\) is:
    • A one unit increase in CollegeEducated (i.e. going from not college educated to college educated) is associated with a \(\beta_1\) increase in the probability of being employed.

Residual Analysis

  • Once a regression is run, we have a new object we can look at, \(\hat{y}\), the prediction.
  • The residuals are the difference between the actual value, \(y\) and the prediction, \(\hat{y}\).
  • The residuals provide all the information not capture by the model and is a measure of \(\varepsilon\).

Residual Analysis

  • Often a good way to work with the residuals is to graph them.
  • This gives us a way to visually inspect the residuals and whether our assumptions hold.
  • What do we know about what the residuals sum to?
Text(0.5, 1.0, 'Residuals vs. X')

Residual Analysis

  • Recall A3: Homoskedasticity

\[ \text{Var}(\varepsilon_i|X) = \sigma^2_{\varepsilon} \]

  • This means that once we condition on \(X\), the variance of the residuals is constant.
  • If we see that the residuals are around the 0 line, then we can say that we have a good fit for the linear model
  • If the residuals are around the same amount apart from the 0 line, then we can say that we have homoskedasticity.

Heteroskedasticity

  • What if A3 is violated?
  • Then we have heteroskedasticity.
Text(0.5, 1.0, 'Residuals vs. X')

Inadequate Model

  • If the residuals are not around the 0 line, then we can say that the model is inadequate.
  • For instance, if a variable has some non-linear relationship with the dependent variable, then the residuals will not be around the 0 line.

Suppose we have a population model:

\[ y = \beta_0 + \beta_1 X^2 + \varepsilon \]

But we don’t know that X has a squared term… If we run a linear regression like so:

\[ y = \beta_0 + \beta_1X + \varepsilon \]

The squared relationships will be in the residuals.

Text(0.5, 1.0, 'Residuals vs. X')