Problem Set 4

EC031-S26

Aleksandr Michuda

Note: The problems may differ based on the edition of the textbook you have.

Problem 1

Briefly explain the difference between \(b_1(\text{OLS})\) and \(\beta_1\); between the residual, \(e_i\), and the regression error, \(\epsilon_i\); and between the OLS predicted value, \(\hat{y_i}\) and \(E(Y_i|X)\).

Solution

\(\beta_1\) is the true, population-level slope parameter representing the change in the expected value of \(y\) associated with a 1 -unit increase in \(x\), while \(b_1\) is the estimate of \(\beta_1\) from an OLS regression of sample values of \(y_i\) on \(x_i\).

The residual, \(e_i\), is the difference between the actual observed value of \(y_i\) and the \(y\)-value the regression estimates would predict for that individual or observation, based on their/its value of \(x\). (That predicted value is \(\hat{y}_i=\mathrm{b}_0+b_1 x_i\).) The regression error term, \(\epsilon_i\), reflects the fact that the population-level relationship between \(x\) and \(y\) is not fully deterministic. Said differently, it represents the unobserved random component of \(y\) that is not captured or explained by variation in \(x\).

The OLS predicted value, \(\hat{y}_i\) is the value of \(y\) that the regression estimates would predict for a given value, \(x_i . E\left(y_i \mid x\right)\) is the population-level average, or expected value, of \(y\) conditional on \(x\) (i.e., for a given value of \(x\) ).

Problem 2

ASW 10.38

Solution

\[ H_0: \mu_1-\mu_2=0 \]

\[ H_{\mathrm{a}}: \mu_1-\mu_2 \neq 0 \]

\[ z=\frac{\left(\bar{x}_1-\bar{x}_2\right)-D_0}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}=\frac{(4.1-3.4)-0}{\sqrt{\frac{(2.2)^2}{120}+\frac{(1.5)^2}{100}}}=2.79 \]

\[ p \text {-value }=2(1.0000-.9974)=.0052 \]

\(p\)-value \(\leq .05\), reject \(H_0\). A difference exists with system B having the lower mean checkout time.

Problem 3

ASW 10.45

Solution

\[ \begin{aligned} & \bar{p}_1=9 / 142=.0634 \\ & \bar{p}_2=5 / 268=.0187 \\ & \bar{p}=\frac{n_1 \bar{p}_1+n_2 \bar{p}_2}{n_1+n_2}=\frac{9+5}{142+268}=.0341 \\ & z=\frac{\bar{p}_1-\bar{p}_2}{\sqrt{\bar{p}(1-\bar{p})\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}}=\frac{.0634-.0187}{\sqrt{.0341(1-.0341)\left(\frac{1}{142}+\frac{1}{268}\right)}}=2.37 \\ & p \text {-value }=2(1.0000-.9911)=.0178 \end{aligned} \]

\(p\)-value \(\leq .02\), reject \(H_0\). There is a significant difference in drug resistance between the two states. Alabama has the higher drug-resistance rate.

Problem 4

ASW 11.23

Solution

\(s^2=(30)^2=900\)
\(\chi_{0 S}^2=30.144\) and \(\chi_{9 S}^2=10.117\) ( 19 degrees of freedom)

\[ \begin{aligned} & \frac{(19)(900)}{30.144} \leq \sigma^2 \leq \frac{(19)(900)}{10.117} \\ & 567 \leq \sigma^2 \leq 1,690 \end{aligned} \]

\(23.8 \leq \sigma \leq 41.1\)

Problem 5

ASW 11.29

Solution

\[ \begin{gathered} s^2=\frac{\Sigma\left(x_i-\bar{x}\right)^2}{n-1}=\frac{101.56}{9-1}=12.69 \\ H_0: \sigma^2=10 \\ H_{\mathrm{a}}: \sigma^2 \neq 10 \\ \chi^2=\frac{(n-1) s^2}{\sigma^2}=\frac{(9-1)(12.69)}{10}=10.16 \end{gathered} \]

Degrees of freedom \(=n-1=8\). Using \(\chi^2\) table, area in tail is greater than .10 .

Two-tail \(p\)-value is greater than .20 . Exact \(p-\) value corresponding to \(\chi^2=10.16\) is .5080 . \(p\)-value \(>.10\); do not reject \(H_0\).

Problem 6

ASW 14.55

Solution

No. Regression or correlation analysis can never prove that two variables are causally related.

Problem 7

ASW 14.1

Solution

There appears to be a positive linear relationship between \(x\) and \(y\).
Many different straight lines can be drawn to provide a linear approximation of the relationship between \(x\) and \(y\); in part d we will determine the equation of a straight line that “best” represents the relationship according to the least squares criterion.
\[ \bar{x}=\frac{\Sigma x_j}{n}=\frac{15}{5}=3 \quad \bar{y}=\frac{\Sigma y_i}{n}=\frac{40}{5}=8 \]

\[ \Sigma\left(x_i-\bar{x}\right)\left(y_i-\bar{y}\right)=26 \quad \Sigma\left(x_i-\bar{x}\right)^2=10 \]

\[ b_1=\frac{\Sigma(x-\bar{x})\left(y_i-\bar{y}\right)}{\Sigma(x,-\bar{x})^z}=\frac{26}{10}=2.6 \]

\[ b_v=\bar{y}-b_1 \bar{i}=8-(2.6)(3)=0.2 \]

\[ \mathrm{y}=0.2+2.6 \mathrm{x} \]

\[ y=0.2+2.6(4)=10.6 \]

Problem 8

ASW 14.47

Solution

Let \(x=\) advertising expenditures and \(y=\) revenue.

\[ \hat{y}=29.4+1.55 x \]

\(\mathrm{SST}=1002 \mathrm{SSE}=310.28 \mathrm{SSR}=691.72\)

\[ \mathrm{MSR}=\mathrm{SSR} / 1=691.72 \]

\[ \mathrm{MSE}=\operatorname{SSE} /(n-2)=310.28 / 5=62.0554 \]

\[ F=\mathrm{MSR} / \mathrm{MSE}=691.72 / 62.0554=11.15 \]

Using the \(F\) table ( 1 degree of freedom numerator and 5 denominator), the \(p\)-value is between .01 and .025 .

Using Excel, the \(p\)-value corresponding to \(F=11.15\) is .0206 . Because \(p\)-value \(\leq \alpha=.05\), we conclude that the two variables are related.

Scatterplot
The residual plot leads us to question the assumption of a linear relationship between x and y. Even though the relationship is significant at the .05 level of significance, it would be extremely dangerous to extrapolate beyond the range of the data.

Stata Exercise: Simulating OLS with Outliers

In this problem, we will simulate a simple linear regression model with one independent variable and one dependent variable. We will then add outliers to the data and see how the OLS estimates change.

Simulate a simple linear regression model with the following data generating process: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \] where \(\beta_0 = 1\), \(\beta_1 = 2\), and \(\epsilon_i \sim N(0, 1)\).

To do this, open a do-file and write:

clear all
set obs 100
gen X = rnormal()
gen epsilon = rnormal()
gen Y = 1 + 2*X + epsilon

Show a scatterplot of X and Y of the resulting data. Make sure to give it a title and axis labels that make it “prettier”. Add a regression line to the scatterplot by running:

scatter Y X || lfit Y X

Estimate the OLS regression of \(Y\) on \(X\). What are the estimated coefficients? Why?
Add an outlier to the data by making the first row of \(Y\) equal to 100:

replace Y = 100 in 1

Estimate the OLS regression of \(Y\) on \(X\) again and show a scatter plot with a fitted line. What are the estimated coefficients? Did \(b_1\) change? Did \(b_0\) change? Why?

Solution

To show a scatterplot of \(X\) and \(Y\):

scatter Y X

To estimate the OLS regression of \(Y\) on \(X\):

reg Y X

The coefficients should be close to 1 and 2, respectively, because there’s no ommitted variable bias or endogeneity.

The outlier has biased the coefficients on the OLS regression. Since the outlier is very higher, it made b0 and b1 both higher.