The purpose of this in-class lab5 is to better understand omitted variable bias and multicollinearity. The lab5 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL5_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(skimr)
library(car)
Also install the package car
by typing in the console:
# install.packages("car", repos='http://cran.us.r-project.org')
and then add to the preamble of your script
library(car)
The car
package allows us to easily compute useful statistics for diagnosing multicollinearity.
We’ll use a new data set on wages, called wage2
.
df <- as_tibble(wage2)
skim(wage2)
Name | wage2 |
Number of rows | 935 |
Number of columns | 17 |
_______________________ | |
Column type frequency: | |
numeric | 17 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
wage | 0 | 1.00 | 957.95 | 404.36 | 115.00 | 669.00 | 905.00 | 1160.00 | 3078.00 | ▅▇▂▁▁ |
hours | 0 | 1.00 | 43.93 | 7.22 | 20.00 | 40.00 | 40.00 | 48.00 | 80.00 | ▁▇▃▁▁ |
IQ | 0 | 1.00 | 101.28 | 15.05 | 50.00 | 92.00 | 102.00 | 112.00 | 145.00 | ▁▃▇▆▁ |
KWW | 0 | 1.00 | 35.74 | 7.64 | 12.00 | 31.00 | 37.00 | 41.00 | 56.00 | ▁▃▇▇▁ |
educ | 0 | 1.00 | 13.47 | 2.20 | 9.00 | 12.00 | 12.00 | 16.00 | 18.00 | ▁▇▃▃▂ |
exper | 0 | 1.00 | 11.56 | 4.37 | 1.00 | 8.00 | 11.00 | 15.00 | 23.00 | ▂▆▇▅▁ |
tenure | 0 | 1.00 | 7.23 | 5.08 | 0.00 | 3.00 | 7.00 | 11.00 | 22.00 | ▇▅▆▂▁ |
age | 0 | 1.00 | 33.08 | 3.11 | 28.00 | 30.00 | 33.00 | 36.00 | 38.00 | ▇▆▅▅▆ |
married | 0 | 1.00 | 0.89 | 0.31 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▁▁▇ |
black | 0 | 1.00 | 0.13 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
south | 0 | 1.00 | 0.34 | 0.47 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
urban | 0 | 1.00 | 0.72 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
sibs | 0 | 1.00 | 2.94 | 2.31 | 0.00 | 1.00 | 2.00 | 4.00 | 14.00 | ▇▆▂▁▁ |
brthord | 83 | 0.91 | 2.28 | 1.60 | 1.00 | 1.00 | 2.00 | 3.00 | 10.00 | ▇▂▁▁▁ |
meduc | 78 | 0.92 | 10.68 | 2.85 | 0.00 | 8.00 | 12.00 | 12.00 | 18.00 | ▁▁▅▇▁ |
feduc | 194 | 0.79 | 10.22 | 3.30 | 0.00 | 8.00 | 10.00 | 12.00 | 18.00 | ▁▃▆▇▂ |
lwage | 0 | 1.00 | 6.78 | 0.42 | 4.74 | 6.51 | 6.81 | 7.06 | 8.03 | ▁▁▆▇▁ |
Check out what’s in the data by typing
glimpse(df)
## Observations: 935
## Variables: 17
## $ wage <int> 769, 808, 825, 650, 562, 1400, 600, 1081, 1154, 1000, 930, ...
## $ hours <int> 40, 50, 40, 40, 40, 40, 40, 40, 45, 40, 43, 38, 45, 38, 40,...
## $ IQ <int> 93, 119, 108, 96, 74, 116, 91, 114, 111, 95, 132, 102, 125,...
## $ KWW <int> 35, 41, 46, 32, 27, 43, 24, 50, 37, 44, 44, 45, 40, 24, 47,...
## $ educ <int> 12, 18, 14, 12, 11, 16, 10, 18, 15, 12, 18, 14, 15, 16, 16,...
## $ exper <int> 11, 11, 11, 13, 14, 14, 13, 8, 13, 16, 8, 9, 4, 7, 9, 17, 6...
## $ tenure <int> 2, 16, 9, 7, 5, 2, 0, 14, 1, 16, 13, 11, 3, 2, 9, 2, 9, 10,...
## $ age <int> 31, 37, 33, 32, 34, 35, 30, 38, 36, 36, 38, 33, 30, 28, 34,...
## $ married <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ black <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ south <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ urban <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ sibs <int> 1, 1, 1, 4, 10, 1, 1, 2, 2, 1, 1, 1, 2, 3, 1, 1, 3, 2, 3, 3...
## $ brthord <int> 2, NA, 2, 3, 6, 2, 2, 3, 3, 1, 1, 2, NA, 1, 1, 2, 3, 3, 1, ...
## $ meduc <int> 8, 14, 14, 12, 6, 8, 8, 8, 14, 12, 13, 16, 12, 10, 12, 6, 1...
## $ feduc <int> 8, 14, 14, 12, 11, NA, 8, NA, 5, 11, 14, NA, 12, 10, 12, 8,...
## $ lwage <dbl> 6.645091, 6.694562, 6.715384, 6.476973, 6.331502, 7.244227,...
Think of the following regression model:
\[ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 IQ + u \]
where \(wage\) is a person’s hourly wage rate (in cents, not dollars).
We want to verify the property in @wooldridge that
\[ \widetilde{\beta}_1 = \hat{\beta}_1+\hat{\beta}_2 \widetilde{\delta}_1 \]
where \(\widetilde{\beta}_1\) comes from a regression of \(log(wage)\) on \(educ\), \(\widetilde{\delta}_1\) comes from a regression of \(IQ\) on \(educ\), and the \(\hat{\beta}\)’s come from the full regression (in the equation above).
First, run a regression of IQ
on educ
to obtain \(\widetilde{\delta}_1\):
est1 <- lm(IQ ~ educ, data=df)
summary(est1)
##
## Call:
## lm(formula = IQ ~ educ, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.228 -7.262 0.907 8.772 37.373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.6872 2.6229 20.47 <2e-16 ***
## educ 3.5338 0.1922 18.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.9 on 933 degrees of freedom
## Multiple R-squared: 0.2659, Adjusted R-squared: 0.2652
## F-statistic: 338 on 1 and 933 DF, p-value: < 2.2e-16
broom::tidy(est1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 53.7 2.62 20.5 3.36e-77
## 2 educ 3.53 0.192 18.4 1.16e-64
Now run a regression of log wage
on educ
to obtain \(\widetilde{\beta}_1\). Note: You’ll need to create the log wage variable first. If you can’t remember how to do that, refer back to previous labs.
est2 <- lm(logwage ~ educ, data=df)
tidy(est2)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.97 0.0814 73.4 0.
## 2 educ 0.0598 0.00596 10.0 1.42e-22
Now run the full regression of log wage
on educ
and IQ
to obtain \(\hat{\beta}_1\) and \(\hat{\beta}_2\). Verify that \(\widetilde{\beta}_1 = \hat{\beta}_1+\hat{\beta}_2 \widetilde{\delta}_1\).
est3 <- lm(logwage ~ educ + IQ, data=df)
tidy(est3)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.66 0.0962 58.8 7.80e-316
## 2 educ 0.0391 0.00684 5.72 1.43e- 8
## 3 IQ 0.00586 0.000998 5.88 5.87e- 9
est2$coefficients["educ"] == est3$coefficients["educ"] +
est3$coefficients["IQ"]*est1$coefficients["educ"]
## educ
## TRUE
(The last line returns TRUE
if the equality holds and FALSE
if it doesn’t hold.)
Is \(\widetilde{\beta}_1\) larger or smaller than \(\hat{\beta}_1\)? What does this mean in terms of omitted variable bias?
Now let’s see how to compute diagnostics of multicollinearity. Recall from @wooldridge that multicollinearity can better be thought of as “a problem with small sample sizes.” Let’s use the meapsingle
data set from the wooldridge
package. We are interested in the variable pctsgle
which gives the percentage of single-parent families residing in the same ZIP code as the school. The outcome variable is math4
, which is the percentage of students at the school who passed the 4th grade state test in math.
Load the data and run a regression of math4
on pctsgle
. Interpret the slope coefficient of this regression. Does the effect seem large?
df <- as_tibble(meapsingle)
est <- lm(math4 ~ pctsgle, data=df) # pctsgle -0.83288 0.07068 -11.78 <2e-16 ***
summary(est)
##
## Call:
## lm(formula = math4 ~ pctsgle, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.791 -8.310 1.600 8.092 50.317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.77043 1.59680 60.60 <2e-16 ***
## pctsgle -0.83288 0.07068 -11.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.48 on 227 degrees of freedom
## Multiple R-squared: 0.3795, Adjusted R-squared: 0.3768
## F-statistic: 138.9 on 1 and 227 DF, p-value: < 2.2e-16
Now consider the same model, but with lmedinc
and free
as additional regressors. lmedinc
is the log median household income of the ZIP code, and free
is the percent of students who qualify for free or reduced-price lunch. Do you think there might be a strong correlation between lmedinc
and free
? Compute the correlation. Does it have the sign you would expect? Do you think it’s close enough to 1 that it would violate the “no perfect collinearity” assumption?
cor(df$lmedinc,df$free)
## [1] -0.7469703
Now run the model with pctsgle
, lmedinc
, and free
as regressors. (Again, I won’t include the code here.)
est <- lm(math4 ~ pctsgle + lmedinc + free, data=df) # pctsgle -0.19965 0.15872 -1.258 0.210
summary(est)
##
## Call:
## lm(formula = math4 ~ pctsgle + lmedinc + free, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.919 -7.195 0.931 7.313 50.152
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.72322 58.47814 0.884 0.377
## pctsgle -0.19965 0.15872 -1.258 0.210
## lmedinc 3.56013 5.04170 0.706 0.481
## free -0.39642 0.07035 -5.635 5.2e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.7 on 225 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4526
## F-statistic: 63.85 on 3 and 225 DF, p-value: < 2.2e-16
Comment on the value of the pctsgle
coefficient, compared to the first regression you ran. What can you say about lmedinc
and free
as confounding variables?
A commonly used diagnostic of multicollinearity is the VIF. We can use the vif()
function from the car
package to do this. Let’s compute the VIF from our estimates in the previous equation:
vif(est)
## pctsgle lmedinc free
## 5.740981 4.118812 3.188079
VIFs of 10 or more are typically thought to be problematic, because \(VIF = \frac{1}{1-R_j^2}\), meaning \(R_j^2>0.9\). See p. 86 of @wooldridge.
Multicollinearity is typically only a problem in data sets of small sample size. As sample size increases, \(R_j^2\) might decrease. Also, the total variation in \(x_j\) (\(SST_j\)) increases with sample size. So multicollinearity is typically not a problem we worry about much.
Wooldridge, Jeffrey M. 2015. Introductory Econometrics: A Modern Approach. 6th ed. Cengage Learning