The purpose of this in-class lab14 is to use R to practice with two-stage least squares (2SLS) estimation. The lab14 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

14.1 For starters

Open up a new R script (named ICL14_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

# Add names of group members HERE
library(tidyverse)
library(wooldridge)
library(broom)
library(AER)
library(magrittr)
library(stargazer)

14.1.1 Load the data

We’re going to use data on working women.

df <- as_tibble(mroz)
df %>% glimpse()
## Observations: 753
## Variables: 22
## $ inlf     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ hours    <int> 1610, 1656, 1980, 456, 1568, 2032, 1440, 1020, 1458, 1600,...
## $ kidslt6  <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ kidsge6  <int> 0, 2, 3, 3, 2, 0, 2, 0, 2, 2, 1, 1, 2, 2, 1, 3, 2, 5, 0, 4...
## $ age      <int> 32, 30, 35, 34, 31, 54, 37, 54, 48, 39, 33, 42, 30, 43, 43...
## $ educ     <int> 12, 12, 12, 12, 14, 12, 16, 12, 12, 12, 12, 11, 12, 12, 10...
## $ wage     <dbl> 3.3540, 1.3889, 4.5455, 1.0965, 4.5918, 4.7421, 8.3333, 7....
## $ repwage  <dbl> 2.65, 2.65, 4.04, 3.25, 3.60, 4.70, 5.95, 9.98, 0.00, 4.15...
## $ hushrs   <int> 2708, 2310, 3072, 1920, 2000, 1040, 2670, 4120, 1995, 2100...
## $ husage   <int> 34, 30, 40, 53, 32, 57, 37, 53, 52, 43, 34, 47, 33, 46, 45...
## $ huseduc  <int> 12, 9, 12, 10, 12, 11, 12, 8, 4, 12, 12, 14, 16, 12, 17, 1...
## $ huswage  <dbl> 4.0288, 8.4416, 3.5807, 3.5417, 10.0000, 6.7106, 3.4277, 2...
## $ faminc   <dbl> 16310, 21800, 21040, 7300, 27300, 19495, 21152, 18900, 204...
## $ mtr      <dbl> 0.7215, 0.6615, 0.6915, 0.7815, 0.6215, 0.6915, 0.6915, 0....
## $ motheduc <int> 12, 7, 12, 7, 12, 14, 14, 3, 7, 7, 12, 14, 16, 10, 7, 16, ...
## $ fatheduc <int> 7, 7, 7, 7, 14, 7, 7, 3, 7, 7, 3, 7, 16, 10, 7, 10, 7, 12,...
## $ unem     <dbl> 5.0, 11.0, 5.0, 5.0, 9.5, 7.5, 5.0, 5.0, 3.0, 5.0, 5.0, 5....
## $ city     <int> 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0...
## $ exper    <int> 14, 5, 15, 6, 7, 33, 11, 35, 24, 21, 15, 14, 0, 14, 6, 9, ...
## $ nwifeinc <dbl> 10.910060, 19.499981, 12.039910, 6.799996, 20.100058, 9.85...
## $ lwage    <dbl> 1.21015370, 0.32851210, 1.51413774, 0.09212332, 1.52427220...
## $ expersq  <int> 196, 25, 225, 36, 49, 1089, 121, 1225, 576, 441, 225, 196,...

14.1.2 Summary statistics

Like last time, let’s use stargazer to get a quick view of our data:

df %>% as.data.frame %>% stargazer(type="text")
## 
## ===================================================================
## Statistic  N     Mean     St. Dev.   Min   Pctl(25) Pctl(75)  Max  
## -------------------------------------------------------------------
## inlf      753   0.568      0.496      0       0        1       1   
## hours     753  740.576    871.314     0       0      1,516   4,950 
## kidslt6   753   0.238      0.524      0       0        0       3   
## kidsge6   753   1.353      1.320      0       0        2       8   
## age       753   42.538     8.073      30      36       49      60  
## educ      753   12.287     2.280      5       12       13      17  
## wage      428   4.178      3.310    0.128   2.263    4.971   25.000
## repwage   753   1.850      2.420    0.000   0.000    3.580   9.980 
## hushrs    753 2,267.271   595.567    175    1,928    2,553   5,010 
## husage    753   45.121     8.059      30      38       52      60  
## huseduc   753   12.491     3.021      3       11       15      17  
## huswage   753   7.482      4.231    0.412   4.788    9.167   40.509
## faminc    753 23,080.600 12,190.200 1,500   15,428   28,200  96,000
## mtr       753   0.679      0.083    0.442   0.622    0.721   0.942 
## motheduc  753   9.251      3.367      0       7        12      17  
## fatheduc  753   8.809      3.572      0       7        12      17  
## unem      753   8.624      3.115      3      7.5       11      14  
## city      753   0.643      0.480      0       0        1       1   
## exper     753   10.631     8.069      0       4        15      45  
## nwifeinc  753   20.129     11.635   -0.029  13.025   24.466  96.000
## lwage     428   1.190      0.723    -2.054  0.817    1.604   3.219 
## expersq   753  178.039    249.631     0       16      225    2,025 
## -------------------------------------------------------------------
  1. Is it a problem that wage and lwage have 428 observations, but all of the other variables have 753 observations?

14.1.3 Drop missing wages

Using the filter() and is.na() functions, drop the observations with missing wages.

df %<>% 
  dplyr::filter(!is.na(wage))

14.2 The model

We want to estimate the return to education for women who are working, using mother’s and father’s education as instruments:

\[ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + u \]

where \(wage\) is the hourly rate of pay, \(educ\) is years of education, and \(exper\) is labor market experience (in years).

14.2.1 First stage regression

Let’s estimate the first stage regression, which is a regression of the endogenous variable (\(educ\)) on the instrument(s) (\(motheduc\) and \(fatheduc\)) and the exogenous explanatory variables (\(exper\) and \(exper^2\)).1

Run this regression. Call the estimation object est.stage1.

est.stage1 <- lm(educ ~ motheduc + fatheduc + exper + I(exper^2), data=df)
summary(est.stage1)
## 
## Call:
## lm(formula = educ ~ motheduc + fatheduc + exper + I(exper^2), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8057 -1.0520 -0.0371  1.0258  6.3787 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.102640   0.426561  21.340  < 2e-16 ***
## motheduc     0.157597   0.035894   4.391 1.43e-05 ***
## fatheduc     0.189548   0.033756   5.615 3.56e-08 ***
## exper        0.045225   0.040251   1.124    0.262    
## I(exper^2)  -0.001009   0.001203  -0.839    0.402    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.039 on 423 degrees of freedom
## Multiple R-squared:  0.2115, Adjusted R-squared:  0.204 
## F-statistic: 28.36 on 4 and 423 DF,  p-value: < 2.2e-16
  1. Double check that \(motheduc\) and \(fatheduc\) are jointly significant with an F-stat larger than 10:
linearHypothesis(est.stage1,c("motheduc","fatheduc"))
## Linear hypothesis test
## 
## Hypothesis:
## motheduc = 0
## fatheduc = 0
## 
## Model 1: restricted model
## Model 2: educ ~ motheduc + fatheduc + exper + I(exper^2)
## 
##   Res.Df    RSS Df Sum of Sq    F    Pr(>F)    
## 1    425 2219.2                                
## 2    423 1758.6  2    460.64 55.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.2.2 Second stage regression

In the second stage, we estimate the log wage equation above, but this time we include \(\widehat{educ}\) on the right hand side instead of \(educ\), where \(\widehat{educ}\) are the fitted values from the first stage.

In R, we can easily access the fitted values by typing fitted(est.stage1).

Let’s estimate the second stage regression:

est.stage2 <- lm(log(wage) ~ fitted(est.stage1) + exper + I(exper^2), data=df)
summary(est.stage2)
## 
## Call:
## lm(formula = log(wage) ~ fitted(est.stage1) + exper + I(exper^2), 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1631 -0.3539  0.0326  0.3818  2.3727 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         0.0481003  0.4197565   0.115  0.90882   
## fitted(est.stage1)  0.0613966  0.0329624   1.863  0.06321 . 
## exper               0.0441704  0.0140844   3.136  0.00183 **
## I(exper^2)         -0.0008990  0.0004212  -2.134  0.03338 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7075 on 424 degrees of freedom
## Multiple R-squared:  0.04978,    Adjusted R-squared:  0.04306 
## F-statistic: 7.405 on 3 and 424 DF,  p-value: 7.615e-05

14.2.3 Both stages at once

The standard errors from the above second stage regression will be incorrect.2 Instead, we should estimate these at the same time. We can do this with the ivreg() function, just like in the previous lab.

est.2sls <- ivreg(log(wage) ~ educ + exper + I(exper^2) | motheduc + fatheduc + exper + I(exper^2), data=df)
  1. Estimate the OLS model (where \(educ\) is not instrumented). Then compare the output for all three models (OLS, 2SLS “by hand”, 2SLS “automatic”).
stargazer(est.ols,
          est.stage2,
          est.2sls, type="text")
## 
## ==============================================================
##                                      Dependent variable:      
##                                -------------------------------
##                                           log(wage)           
##                                       OLS         instrumental
##                                                     variable  
##                                   (1)      (2)        (3)     
## --------------------------------------------------------------
## educ                           0.107***              0.061*   
##                                 (0.014)             (0.031)   
##                                                               
## fitted(est.stage1)                        0.061*              
##                                          (0.033)              
##                                                               
## exper                          0.042***  0.044***   0.044***  
##                                 (0.013)  (0.014)    (0.013)   
##                                                               
## I(exper2)                      -0.001**  -0.001**   -0.001**  
##                                (0.0004)  (0.0004)   (0.0004)  
##                                                               
## Constant                       -0.522***  0.048      0.048    
##                                 (0.199)  (0.420)    (0.400)   
##                                                               
## --------------------------------------------------------------
## Observations                      428      428        428     
## R2                               0.157    0.050      0.136    
## Adjusted R2                      0.151    0.043      0.130    
## Residual Std. Error (df = 424)   0.666    0.707      0.675    
## F Statistic (df = 3; 424)      26.286*** 7.405***             
## ==============================================================
## Note:                              *p<0.1; **p<0.05; ***p<0.01
  1. Comment on the IV estimates. Do they make sense, relative to what we think would bias the returns to education? Is the exogeneity condition on \(motheduc\) and \(fatheduc\) plausible?

  1. Note that you can easily include the quadratic in experience as I(exper^2) without having to create this variable in a mutate() statement.

  2. The reason is that error term in the second stage regression includes the residuals from the first stage, but the standard errors fail to take this into account.