Regressions come from the idea of “regressions to the mean”
The idea that extreme values tend to be followed by more moderate values
“Regression towards mediocrity in hereditary stature”
Francis Galton observed that tall parents tend to have children who are shorter than them, and short parents tend to have children who are taller than them.
So it seems that “extreme” characteristics tend to “regress” towards the average in the next generation.
What is a regression?
Some examples of this are:
Students who score extremely high or low on a test tend to score closer to the average on a subsequent test.
Athletes who perform exceptionally well in their first year often see a decline in performance in their second year.
Stock prices that experience extreme fluctuations often revert to their average levels over time.
The sophomore slump: students who do very well in their first year of college often see a decline in performance in their second year.
What is a regression?
Galton coined the term “regression to the mean” and developed the simple linear regression model.
Nowadays, though, the term “regression” refers to a statistical method used to model the relationship between a dependent variable and one or more independent variables.
Why do we use regressions?
Linear regression is used extensively in econometrics
This can be for several reasons:
It is relatively simple to understand and implement.
The results are easy to interpret.
It often provides a good approximation of the relationship between variables, even when the true relationship is not linear.
It is relatively simple to understand and implement.
The linear regression model assumes a linear relationship between the dependent variable and the independent variables.
The a simple regression model can be expressed mathematically as:
\[
Y = b_0 + b_1 X + e
\]
Where: - \(Y\) is the dependent variable. - \(X\) is the independent variable. - \(b_0\) is the intercept. - \(b_1\) is the slope coefficient. - \(u\) is the error term.
You can think of the coefficients as the parameters that define the line that best fits the data.
The results are easy to interpret.
The coefficients in a linear regression model have a straightforward interpretation.
For example, in the simple regression model above, \(b_1\) represents the change in the dependent variable \(Y\) for a one-unit increase in the independent variable \(X\), holding all other factors constant.
This makes it easy to understand the impact of each independent variable on the dependent variable.
It often provides a good approximation of the relationship between variables, even when the true relationship is not linear.
This is a little harder to explain, so let’s start with: conditional expectations
Recall the conditional expectation:
\[
E[Y|X] = \int y f(y|X) dy
\]
Mean Independence
An incredibly important concept and assumption in econometrics is mean independence.
\[
E(\epsilon|X) = E(\epsilon)
\]
If this equation holds, then we say that \(u\) is mean independent of \(X\).
This means that once we know \(X\), we don’t learn anything new about the expected value of \(u\).
This is a weaker condition than full independence, which would require that the entire distribution of \(u\) is independent of \(X\).
Example
Let’s say we are studying the relationship between wages and education level.
Let \(Y\) be the wage and \(X\) be the years of education.
Mean independence would imply that \(E(Y|X=8)=E(Y|X=12)=E(Y|X=16)\).
So this is saying that your expected wage is the same regardless of whether you have 8, 12, or 16 years of education.
Is that a good assumption?
Example
Let’s say that instead, we have \(u\) be unobserved ability.
Mean independence would imply that \(E(u|X=8)=E(u|X=12)=E(u|X=16)\).
So this is saying that your expected ability is the same regardless of whether you have 8, 12, or 16 years of education.
Is that a good assumption?
Example
Can you tell me a story where this assumption might hold?
What theory of ability might lead us to believe this?
Can you tell me a story where this assumption might fail?
What theory of ability might lead us to believe this assumption fails?
The Conditional Expectation Function
But let’s say that we are willing to make that mean independence assumption.
And lets add an assumption, that one average ability it 0. This is called a normalization.
\[
E(\epsilon) = 0
\]
Then what you get is a key identifying assumption in econometrics:
\[
E(\epsilon|X)=0
\]
This is called the zero conditional mean assumption.
\[
Y = E(Y|X) + \epsilon
\]
This is called the conditional expectation function (CEF).
What is the point?
What was the point of all that setup?
We now have two objects to think about:
\[
Y = b_0 + b_1 X + e
\]
and
\[
Y = E(Y|X) + \epsilon =?= \beta_0 + \beta_1 X + \epsilon
\]
This first is a linear regression line, or the best fit line, which we will call OLS.
The second is the true population regression function.
The CEF need not be linear, but if it is, then we can write it as \(\beta_0 + \beta_1 X\). But we’ll see that even if it isn’t, we are still in good shape.
If we can estimate \(b_0\) and \(b_1\) to be equal to \(\beta_0\) and \(\beta_1\), then we have causally identified the effect of \(X\) on \(Y\).
The Conditional Expectation Function Decomposition
Note that \(Y = E(Y|X) + \epsilon\) actually needs to be proven.
This is actually a pretty powerful idea.
It says that any random variable \(Y\) can be decomposed into two parts:
A part that is explained by \(X\) (the conditional expectation)
A part that is not explained by \(X\) (the error term)
It’s only true if:
\(E(\epsilon|X) = 0\) (zero conditional mean assumption)
\(\epsilon\) is uncorrelated with any function of \(X\) (which follows from 1)
Your turn, assume that you have some function \(X\), \(h(X)\).
Other properties
The CEF is the best predictor of \(Y\) given \(X\) in the mean squared error sense, so regardless of what function of \(X\) you are using.
ANOVA: \(Var(Y) = Var(E(Y|X)) + E(Var(Y|X))\)
The total variance of \(Y\) can be decomposed into the variance explained by \(X\) and the variance not explained by \(X\).
Again.. what’s the point?
All of this lead up was to motivate one thing…
Applied econometricians often use linear models.
But we can all agree that the world is not linear.
The true data-generating process and population model is incredibly complicated.
So why do we use linear models so much?
Linear Regression, the CEF and Linearity
If the population CEF actually, is linear in \(X\), then the linear regression model is correctly specified.
If the population CEF is not linear in \(X\), then the linear regression model is misspecified. But! The linear regressions model still provides the best linear approximation to the true CEF.
So even if the true relationship between \(X\) and \(Y\) is not linear, the linear regression model can still provide a useful first-order approximation of that relationship.
What is a best fit line?
How do we find the best fit line?
How well does it fit the data?
Different samples would yield different lines
Estimating OLS
How do we estimate \(b_0\) and \(b_1\)?
We use the method of Ordinary Least Squares (OLS).
This means that we minimize the mean squared error (MSE) between the observed values of \(Y\) and the predicted values of \(Y\) from the regression line.
What is a residual?
A residual is the difference between the observed value of the dependent variable and the value predicted by the regression model.
\[
e_i = Y_i - b_0 - b_1 X_i
\]
for each observation \(i\) in our sample.
Estimating OLS
Note the two assumptions we made earlier:
Mean independence: \(E(\epsilon|X) = E(\epsilon)\)
Normalization: \(E(\epsilon) = 0\)
We can derive the OLS estimators for \(b_0\) and \(b_1\) using these assumptions.
From here, there are two things we can say further:
Mean independence gives us that \(E(X \epsilon) = 0\)
From there we can also say that \(Cov(X, \epsilon) = 0\)
If we want to understant how well our model fits the data, an intuitive way to do so would be to ask the question:
How much of the total variation in \(Y\) is explained by the model (\(X\))?
This is exactly what R-squared (\(R^2\)) measures:
\[
R^2 = \frac{ESS}{TSS} = 1 - \frac{RSS}{TSS}
\]
One is in terms of explained variation, the other in terms of unexplained variation.
Goodness of Fit
Why is it named \(R^2\)?
The correlation coefficient is sometimes know as \(R\).
Remember that correlation is between -1 and 1.
\[
R = \frac{Cov(X,Y)}{\sqrt{Var(X) Var(Y)}}
\]
It turns out that in simple linear regression, the square of the correlation coefficient between \(X\) and \(Y\) is equal to the R-squared value from the regression.
So \(R^2\) is the square of the correlation coefficient between the independent variable and the dependent variable in a simple linear regression.
Goodness of Fit
So what does \(R^2\) tell us?
\(R^2\) is between 0 and 1.
An \(R^2\) of 0 means that the model does not explain any of the variation in \(Y\).
An \(R^2\) of 1 means that the model explains all of the variation in \(Y\).
In general, a higher \(R^2\) indicates a better fit of the model to the data.
It does not imply causation.
High fit of the data doesn’t mean that X causes Y.
\(R^2\) is not a good measure across different \(Y\) variables.
But it is good when thinking about different \(X\) variables for the same \(Y\) variable.