The purpose of this in-class lab14 is to use R to practice with two-stage least squares (2SLS) estimation. The lab14 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL14_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(wooldridge)
library(broom)
library(AER)
library(magrittr)
library(stargazer)
We’re going to use data on working women.
df <- as_tibble(mroz)
df %>% glimpse()
## Observations: 753
## Variables: 22
## $ inlf <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ hours <int> 1610, 1656, 1980, 456, 1568, 2032, 1440, 1020, 1458, 1600,...
## $ kidslt6 <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ kidsge6 <int> 0, 2, 3, 3, 2, 0, 2, 0, 2, 2, 1, 1, 2, 2, 1, 3, 2, 5, 0, 4...
## $ age <int> 32, 30, 35, 34, 31, 54, 37, 54, 48, 39, 33, 42, 30, 43, 43...
## $ educ <int> 12, 12, 12, 12, 14, 12, 16, 12, 12, 12, 12, 11, 12, 12, 10...
## $ wage <dbl> 3.3540, 1.3889, 4.5455, 1.0965, 4.5918, 4.7421, 8.3333, 7....
## $ repwage <dbl> 2.65, 2.65, 4.04, 3.25, 3.60, 4.70, 5.95, 9.98, 0.00, 4.15...
## $ hushrs <int> 2708, 2310, 3072, 1920, 2000, 1040, 2670, 4120, 1995, 2100...
## $ husage <int> 34, 30, 40, 53, 32, 57, 37, 53, 52, 43, 34, 47, 33, 46, 45...
## $ huseduc <int> 12, 9, 12, 10, 12, 11, 12, 8, 4, 12, 12, 14, 16, 12, 17, 1...
## $ huswage <dbl> 4.0288, 8.4416, 3.5807, 3.5417, 10.0000, 6.7106, 3.4277, 2...
## $ faminc <dbl> 16310, 21800, 21040, 7300, 27300, 19495, 21152, 18900, 204...
## $ mtr <dbl> 0.7215, 0.6615, 0.6915, 0.7815, 0.6215, 0.6915, 0.6915, 0....
## $ motheduc <int> 12, 7, 12, 7, 12, 14, 14, 3, 7, 7, 12, 14, 16, 10, 7, 16, ...
## $ fatheduc <int> 7, 7, 7, 7, 14, 7, 7, 3, 7, 7, 3, 7, 16, 10, 7, 10, 7, 12,...
## $ unem <dbl> 5.0, 11.0, 5.0, 5.0, 9.5, 7.5, 5.0, 5.0, 3.0, 5.0, 5.0, 5....
## $ city <int> 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0...
## $ exper <int> 14, 5, 15, 6, 7, 33, 11, 35, 24, 21, 15, 14, 0, 14, 6, 9, ...
## $ nwifeinc <dbl> 10.910060, 19.499981, 12.039910, 6.799996, 20.100058, 9.85...
## $ lwage <dbl> 1.21015370, 0.32851210, 1.51413774, 0.09212332, 1.52427220...
## $ expersq <int> 196, 25, 225, 36, 49, 1089, 121, 1225, 576, 441, 225, 196,...
Like last time, let’s use stargazer
to get a quick view of our data:
df %>% as.data.frame %>% stargazer(type="text")
##
## ===================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------
## inlf 753 0.568 0.496 0 0 1 1
## hours 753 740.576 871.314 0 0 1,516 4,950
## kidslt6 753 0.238 0.524 0 0 0 3
## kidsge6 753 1.353 1.320 0 0 2 8
## age 753 42.538 8.073 30 36 49 60
## educ 753 12.287 2.280 5 12 13 17
## wage 428 4.178 3.310 0.128 2.263 4.971 25.000
## repwage 753 1.850 2.420 0.000 0.000 3.580 9.980
## hushrs 753 2,267.271 595.567 175 1,928 2,553 5,010
## husage 753 45.121 8.059 30 38 52 60
## huseduc 753 12.491 3.021 3 11 15 17
## huswage 753 7.482 4.231 0.412 4.788 9.167 40.509
## faminc 753 23,080.600 12,190.200 1,500 15,428 28,200 96,000
## mtr 753 0.679 0.083 0.442 0.622 0.721 0.942
## motheduc 753 9.251 3.367 0 7 12 17
## fatheduc 753 8.809 3.572 0 7 12 17
## unem 753 8.624 3.115 3 7.5 11 14
## city 753 0.643 0.480 0 0 1 1
## exper 753 10.631 8.069 0 4 15 45
## nwifeinc 753 20.129 11.635 -0.029 13.025 24.466 96.000
## lwage 428 1.190 0.723 -2.054 0.817 1.604 3.219
## expersq 753 178.039 249.631 0 16 225 2,025
## -------------------------------------------------------------------
wage
and lwage
have 428 observations, but all of the other variables have 753 observations?Using the filter()
and is.na()
functions, drop the observations with missing wages.
df %<>%
dplyr::filter(!is.na(wage))
We want to estimate the return to education for women who are working, using mother’s and father’s education as instruments:
\[ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + u \]
where \(wage\) is the hourly rate of pay, \(educ\) is years of education, and \(exper\) is labor market experience (in years).
Let’s estimate the first stage regression, which is a regression of the endogenous variable (\(educ\)) on the instrument(s) (\(motheduc\) and \(fatheduc\)) and the exogenous explanatory variables (\(exper\) and \(exper^2\)).1
Run this regression. Call the estimation object est.stage1
.
est.stage1 <- lm(educ ~ motheduc + fatheduc + exper + I(exper^2), data=df)
summary(est.stage1)
##
## Call:
## lm(formula = educ ~ motheduc + fatheduc + exper + I(exper^2),
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8057 -1.0520 -0.0371 1.0258 6.3787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.102640 0.426561 21.340 < 2e-16 ***
## motheduc 0.157597 0.035894 4.391 1.43e-05 ***
## fatheduc 0.189548 0.033756 5.615 3.56e-08 ***
## exper 0.045225 0.040251 1.124 0.262
## I(exper^2) -0.001009 0.001203 -0.839 0.402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.039 on 423 degrees of freedom
## Multiple R-squared: 0.2115, Adjusted R-squared: 0.204
## F-statistic: 28.36 on 4 and 423 DF, p-value: < 2.2e-16
linearHypothesis(est.stage1,c("motheduc","fatheduc"))
## Linear hypothesis test
##
## Hypothesis:
## motheduc = 0
## fatheduc = 0
##
## Model 1: restricted model
## Model 2: educ ~ motheduc + fatheduc + exper + I(exper^2)
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 425 2219.2
## 2 423 1758.6 2 460.64 55.4 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the second stage, we estimate the log wage equation above, but this time we include \(\widehat{educ}\) on the right hand side instead of \(educ\), where \(\widehat{educ}\) are the fitted values from the first stage.
In R, we can easily access the fitted values by typing fitted(est.stage1)
.
Let’s estimate the second stage regression:
est.stage2 <- lm(log(wage) ~ fitted(est.stage1) + exper + I(exper^2), data=df)
summary(est.stage2)
##
## Call:
## lm(formula = log(wage) ~ fitted(est.stage1) + exper + I(exper^2),
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1631 -0.3539 0.0326 0.3818 2.3727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0481003 0.4197565 0.115 0.90882
## fitted(est.stage1) 0.0613966 0.0329624 1.863 0.06321 .
## exper 0.0441704 0.0140844 3.136 0.00183 **
## I(exper^2) -0.0008990 0.0004212 -2.134 0.03338 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7075 on 424 degrees of freedom
## Multiple R-squared: 0.04978, Adjusted R-squared: 0.04306
## F-statistic: 7.405 on 3 and 424 DF, p-value: 7.615e-05
The standard errors from the above second stage regression will be incorrect.2 Instead, we should estimate these at the same time. We can do this with the ivreg()
function, just like in the previous lab.
est.2sls <- ivreg(log(wage) ~ educ + exper + I(exper^2) | motheduc + fatheduc + exper + I(exper^2), data=df)
stargazer(est.ols,
est.stage2,
est.2sls, type="text")
##
## ==============================================================
## Dependent variable:
## -------------------------------
## log(wage)
## OLS instrumental
## variable
## (1) (2) (3)
## --------------------------------------------------------------
## educ 0.107*** 0.061*
## (0.014) (0.031)
##
## fitted(est.stage1) 0.061*
## (0.033)
##
## exper 0.042*** 0.044*** 0.044***
## (0.013) (0.014) (0.013)
##
## I(exper2) -0.001** -0.001** -0.001**
## (0.0004) (0.0004) (0.0004)
##
## Constant -0.522*** 0.048 0.048
## (0.199) (0.420) (0.400)
##
## --------------------------------------------------------------
## Observations 428 428 428
## R2 0.157 0.050 0.136
## Adjusted R2 0.151 0.043 0.130
## Residual Std. Error (df = 424) 0.666 0.707 0.675
## F Statistic (df = 3; 424) 26.286*** 7.405***
## ==============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Note that you can easily include the quadratic in experience as I(exper^2)
without having to create this variable in a mutate()
statement.↩
The reason is that error term in the second stage regression includes the residuals from the first stage, but the standard errors fail to take this into account.↩