遗漏变量偏差和多重共线性

The purpose of this in-class lab5 is to better understand omitted variable bias and multicollinearity. The lab5 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

5.1 For starters

Open up a new R script (named ICL5_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(skimr)
library(car)

Also install the package car by typing in the console:

# install.packages("car", repos='http://cran.us.r-project.org')

and then add to the preamble of your script

library(car)

The car package allows us to easily compute useful statistics for diagnosing multicollinearity.

5.1.1 Load the data

We’ll use a new data set on wages, called wage2.

df <- as_tibble(wage2)
skim(wage2)

Data summary
Name	wage2
Number of rows	935
Number of columns	17
_______________________
Column type frequency:
numeric	17
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
wage	0	1.00	957.95	404.36	115.00	669.00	905.00	1160.00	3078.00	▅▇▂▁▁
hours	0	1.00	43.93	7.22	20.00	40.00	40.00	48.00	80.00	▁▇▃▁▁
IQ	0	1.00	101.28	15.05	50.00	92.00	102.00	112.00	145.00	▁▃▇▆▁
KWW	0	1.00	35.74	7.64	12.00	31.00	37.00	41.00	56.00	▁▃▇▇▁
educ	0	1.00	13.47	2.20	9.00	12.00	12.00	16.00	18.00	▁▇▃▃▂
exper	0	1.00	11.56	4.37	1.00	8.00	11.00	15.00	23.00	▂▆▇▅▁
tenure	0	1.00	7.23	5.08	0.00	3.00	7.00	11.00	22.00	▇▅▆▂▁
age	0	1.00	33.08	3.11	28.00	30.00	33.00	36.00	38.00	▇▆▅▅▆
married	0	1.00	0.89	0.31	0.00	1.00	1.00	1.00	1.00	▁▁▁▁▇
black	0	1.00	0.13	0.33	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
south	0	1.00	0.34	0.47	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▅
urban	0	1.00	0.72	0.45	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
sibs	0	1.00	2.94	2.31	0.00	1.00	2.00	4.00	14.00	▇▆▂▁▁
brthord	83	0.91	2.28	1.60	1.00	1.00	2.00	3.00	10.00	▇▂▁▁▁
meduc	78	0.92	10.68	2.85	0.00	8.00	12.00	12.00	18.00	▁▁▅▇▁
feduc	194	0.79	10.22	3.30	0.00	8.00	10.00	12.00	18.00	▁▃▆▇▂
lwage	0	1.00	6.78	0.42	4.74	6.51	6.81	7.06	8.03	▁▁▆▇▁

Check out what’s in the data by typing

glimpse(df)

## Observations: 935
## Variables: 17
## $ wage    <int> 769, 808, 825, 650, 562, 1400, 600, 1081, 1154, 1000, 930, ...
## $ hours   <int> 40, 50, 40, 40, 40, 40, 40, 40, 45, 40, 43, 38, 45, 38, 40,...
## $ IQ      <int> 93, 119, 108, 96, 74, 116, 91, 114, 111, 95, 132, 102, 125,...
## $ KWW     <int> 35, 41, 46, 32, 27, 43, 24, 50, 37, 44, 44, 45, 40, 24, 47,...
## $ educ    <int> 12, 18, 14, 12, 11, 16, 10, 18, 15, 12, 18, 14, 15, 16, 16,...
## $ exper   <int> 11, 11, 11, 13, 14, 14, 13, 8, 13, 16, 8, 9, 4, 7, 9, 17, 6...
## $ tenure  <int> 2, 16, 9, 7, 5, 2, 0, 14, 1, 16, 13, 11, 3, 2, 9, 2, 9, 10,...
## $ age     <int> 31, 37, 33, 32, 34, 35, 30, 38, 36, 36, 38, 33, 30, 28, 34,...
## $ married <int> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ black   <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ south   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ urban   <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,...
## $ sibs    <int> 1, 1, 1, 4, 10, 1, 1, 2, 2, 1, 1, 1, 2, 3, 1, 1, 3, 2, 3, 3...
## $ brthord <int> 2, NA, 2, 3, 6, 2, 2, 3, 3, 1, 1, 2, NA, 1, 1, 2, 3, 3, 1, ...
## $ meduc   <int> 8, 14, 14, 12, 6, 8, 8, 8, 14, 12, 13, 16, 12, 10, 12, 6, 1...
## $ feduc   <int> 8, 14, 14, 12, 11, NA, 8, NA, 5, 11, 14, NA, 12, 10, 12, 8,...
## $ lwage   <dbl> 6.645091, 6.694562, 6.715384, 6.476973, 6.331502, 7.244227,...

5.2 Properties of Omitted Variables

Think of the following regression model:

\[ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 IQ + u \]

where \(wage\) is a person’s hourly wage rate (in cents, not dollars).

We want to verify the property in @wooldridge that

\[ \widetilde{\beta}_1 = \hat{\beta}_1+\hat{\beta}_2 \widetilde{\delta}_1 \]

where \(\widetilde{\beta}_1\) comes from a regression of \(log(wage)\) on \(educ\), \(\widetilde{\delta}_1\) comes from a regression of \(IQ\) on \(educ\), and the \(\hat{\beta}\)’s come from the full regression (in the equation above).

First, run a regression of IQ on educ to obtain \(\widetilde{\delta}_1\):

est1 <- lm(IQ ~ educ, data=df)
summary(est1)

## 
## Call:
## lm(formula = IQ ~ educ, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.228  -7.262   0.907   8.772  37.373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53.6872     2.6229   20.47   <2e-16 ***
## educ          3.5338     0.1922   18.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 933 degrees of freedom
## Multiple R-squared:  0.2659, Adjusted R-squared:  0.2652 
## F-statistic:   338 on 1 and 933 DF,  p-value: < 2.2e-16

broom::tidy(est1)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    53.7      2.62       20.5 3.36e-77
## 2 educ            3.53     0.192      18.4 1.16e-64

Now run a regression of log wage on educ to obtain \(\widetilde{\beta}_1\). Note: You’ll need to create the log wage variable first. If you can’t remember how to do that, refer back to previous labs.

est2 <- lm(logwage ~ educ, data=df)
tidy(est2)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   5.97     0.0814       73.4 0.      
## 2 educ          0.0598   0.00596      10.0 1.42e-22

Now run the full regression of log wage on educ and IQ to obtain \(\hat{\beta}_1\) and \(\hat{\beta}_2\). Verify that \(\widetilde{\beta}_1 = \hat{\beta}_1+\hat{\beta}_2 \widetilde{\delta}_1\).

est3 <- lm(logwage ~ educ + IQ, data=df)
tidy(est3)

## # A tibble: 3 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  5.66     0.0962       58.8  7.80e-316
## 2 educ         0.0391   0.00684       5.72 1.43e-  8
## 3 IQ           0.00586  0.000998      5.88 5.87e-  9

est2$coefficients["educ"] == est3$coefficients["educ"] +
    est3$coefficients["IQ"]*est1$coefficients["educ"]

## educ 
## TRUE

(The last line returns TRUE if the equality holds and FALSE if it doesn’t hold.)

Is \(\widetilde{\beta}_1\) larger or smaller than \(\hat{\beta}_1\)? What does this mean in terms of omitted variable bias?

5.3 Multicollinearity

Now let’s see how to compute diagnostics of multicollinearity. Recall from @wooldridge that multicollinearity can better be thought of as “a problem with small sample sizes.” Let’s use the meapsingle data set from the wooldridge package. We are interested in the variable pctsgle which gives the percentage of single-parent families residing in the same ZIP code as the school. The outcome variable is math4, which is the percentage of students at the school who passed the 4th grade state test in math.

Load the data and run a regression of math4 on pctsgle. Interpret the slope coefficient of this regression. Does the effect seem large?

df <- as_tibble(meapsingle)
est <- lm(math4 ~ pctsgle, data=df) # pctsgle     -0.83288    0.07068  -11.78   <2e-16 ***
summary(est)

## 
## Call:
## lm(formula = math4 ~ pctsgle, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.791  -8.310   1.600   8.092  50.317 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 96.77043    1.59680   60.60   <2e-16 ***
## pctsgle     -0.83288    0.07068  -11.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.48 on 227 degrees of freedom
## Multiple R-squared:  0.3795, Adjusted R-squared:  0.3768 
## F-statistic: 138.9 on 1 and 227 DF,  p-value: < 2.2e-16

Now consider the same model, but with lmedinc and free as additional regressors. lmedinc is the log median household income of the ZIP code, and free is the percent of students who qualify for free or reduced-price lunch. Do you think there might be a strong correlation between lmedinc and free? Compute the correlation. Does it have the sign you would expect? Do you think it’s close enough to 1 that it would violate the “no perfect collinearity” assumption?

cor(df$lmedinc,df$free)

## [1] -0.7469703

Now run the model with pctsgle, lmedinc, and free as regressors. (Again, I won’t include the code here.)

est <- lm(math4 ~ pctsgle + lmedinc + free, data=df) # pctsgle     -0.19965    0.15872  -1.258    0.210  
summary(est)

## 
## Call:
## lm(formula = math4 ~ pctsgle + lmedinc + free, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.919  -7.195   0.931   7.313  50.152 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 51.72322   58.47814   0.884    0.377    
## pctsgle     -0.19965    0.15872  -1.258    0.210    
## lmedinc      3.56013    5.04170   0.706    0.481    
## free        -0.39642    0.07035  -5.635  5.2e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.7 on 225 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4526 
## F-statistic: 63.85 on 3 and 225 DF,  p-value: < 2.2e-16

Comment on the value of the pctsgle coefficient, compared to the first regression you ran. What can you say about lmedinc and free as confounding variables?

5.3.1 Computing variance inflation factors (VIF)

A commonly used diagnostic of multicollinearity is the VIF. We can use the vif() function from the car package to do this. Let’s compute the VIF from our estimates in the previous equation:

vif(est)

##  pctsgle  lmedinc     free 
## 5.740981 4.118812 3.188079

VIFs of 10 or more are typically thought to be problematic, because \(VIF = \frac{1}{1-R_j^2}\), meaning \(R_j^2>0.9\). See p. 86 of @wooldridge.

5.3.2 Is multicollinearity a problem?

Multicollinearity is typically only a problem in data sets of small sample size. As sample size increases, \(R_j^2\) might decrease. Also, the total variation in \(x_j\) (\(SST_j\)) increases with sample size. So multicollinearity is typically not a problem we worry about much.

5.4 References

Wooldridge, Jeffrey M. 2015. Introductory Econometrics: A Modern Approach. 6th ed. Cengage Learning