The purpose of this in-class lab13 is to use R to practice with instrumental variables estimation. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
You may need to install the package AER
. (It may have already been installed when you previously installed car
and zoo
.)
Open up a new R script (named ICL13_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(wooldridge)
library(broom)
library(AER)
library(magrittr)
library(stargazer)
We’re going to use data on fertility of Botswanian women.
df <- as_tibble(fertil2)
df %>% glimpse()
## Observations: 4,361
## Variables: 27
## $ mnthborn <int> 5, 1, 7, 11, 5, 8, 7, 9, 12, 9, 6, 10, 12, 2, 1, 6, 1, 8, ...
## $ yearborn <int> 64, 56, 58, 45, 45, 52, 51, 70, 53, 39, 46, 59, 42, 40, 53...
## $ age <int> 24, 32, 30, 42, 43, 36, 37, 18, 34, 49, 42, 29, 45, 48, 35...
## $ electric <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ radio <int> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ tv <int> 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ bicycle <int> 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0...
## $ educ <int> 12, 13, 5, 4, 11, 7, 16, 10, 5, 4, 15, 7, 0, 4, 12, 7, 7, ...
## $ ceb <int> 0, 3, 1, 3, 2, 1, 4, 0, 1, 0, 3, 3, 4, 10, 3, 0, 4, 2, 0, ...
## $ agefbrth <int> NA, 25, 27, 17, 24, 26, 20, NA, 19, NA, 25, 23, 18, 19, 23...
## $ children <int> 0, 3, 1, 2, 2, 1, 4, 0, 1, 0, 3, 3, 2, 8, 3, 0, 4, 2, 0, 1...
## $ knowmeth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ usemeth <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1...
## $ monthfm <int> NA, 11, 6, 1, 3, 11, 5, NA, 7, 11, 6, 1, 1, 10, 1, NA, NA,...
## $ yearfm <int> NA, 80, 83, 61, 66, 76, 78, NA, 72, 61, 70, 84, 66, 66, 74...
## $ agefm <int> NA, 24, 24, 15, 20, 24, 26, NA, 18, 22, 24, 24, 23, 26, 21...
## $ idlnchld <int> 2, 3, 5, 3, 2, 4, 4, 4, 4, 4, 3, 6, 6, 4, 3, 4, 5, 1, 2, 3...
## $ heduc <int> NA, 12, 7, 11, 14, 9, 17, NA, 3, 1, 16, 7, NA, 3, 16, NA, ...
## $ agesq <int> 576, 1024, 900, 1764, 1849, 1296, 1369, 324, 1156, 2401, 1...
## $ urban <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ urb_educ <int> 12, 13, 5, 4, 11, 7, 16, 10, 5, 4, 15, 7, 0, 4, 12, 7, 7, ...
## $ spirit <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1...
## $ protest <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0...
## $ catholic <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ frsthalf <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0...
## $ educ0 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ evermarr <int> 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1...
We can easily compute summary statistics of our data by using the stargazer
package:
df %>% as.data.frame %>% stargazer(type="text")
##
## ================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## ----------------------------------------------------------------
## mnthborn 4,361 6.331 3.323 1 3 9 12
## yearborn 4,361 60.434 8.683 38 55 68 73
## age 4,361 27.405 8.685 15 20 33 49
## electric 4,358 0.140 0.347 0.000 0.000 0.000 1.000
## radio 4,359 0.702 0.458 0.000 0.000 1.000 1.000
## tv 4,359 0.093 0.290 0.000 0.000 0.000 1.000
## bicycle 4,358 0.276 0.447 0.000 0.000 1.000 1.000
## educ 4,361 5.856 3.927 0 3 8 20
## ceb 4,361 2.442 2.407 0 1 4 13
## agefbrth 3,273 19.011 3.092 10.000 17.000 20.000 38.000
## children 4,361 2.268 2.222 0 0 4 13
## knowmeth 4,354 0.963 0.188 0.000 1.000 1.000 1.000
## usemeth 4,290 0.578 0.494 0.000 0.000 1.000 1.000
## monthfm 2,079 6.270 3.620 1.000 3.000 9.000 12.000
## yearfm 2,079 76.912 7.760 50.000 72.000 83.000 88.000
## agefm 2,079 20.686 5.002 10.000 17.000 23.000 46.000
## idlnchld 4,241 4.616 2.219 0.000 3.000 6.000 20.000
## heduc 1,956 5.145 4.803 0.000 0.000 8.000 20.000
## agesq 4,361 826.460 526.923 225 400 1,089 2,401
## urban 4,361 0.517 0.500 0 0 1 1
## urb_educ 4,361 3.469 4.294 0 0 7 20
## spirit 4,361 0.422 0.494 0 0 1 1
## protest 4,361 0.228 0.419 0 0 0 1
## catholic 4,361 0.102 0.303 0 0 0 1
## frsthalf 4,361 0.540 0.498 0 0 1 1
## educ0 4,361 0.208 0.406 0 0 0 1
## evermarr 4,361 0.477 0.500 0 0 1 1
## ----------------------------------------------------------------
or:
df %>% as.data.frame() %>% skimr::skim()
Name | Piped data |
Number of rows | 4361 |
Number of columns | 27 |
_______________________ | |
Column type frequency: | |
numeric | 27 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
mnthborn | 0 | 1.00 | 6.33 | 3.32 | 1 | 3 | 6 | 9 | 12 | ▇▅▇▆▆ |
yearborn | 0 | 1.00 | 60.43 | 8.68 | 38 | 55 | 62 | 68 | 73 | ▂▃▆▇▇ |
age | 0 | 1.00 | 27.41 | 8.69 | 15 | 20 | 26 | 33 | 49 | ▇▇▆▃▂ |
electric | 3 | 1.00 | 0.14 | 0.35 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
radio | 2 | 1.00 | 0.70 | 0.46 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
tv | 2 | 1.00 | 0.09 | 0.29 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
bicycle | 3 | 1.00 | 0.28 | 0.45 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
educ | 0 | 1.00 | 5.86 | 3.93 | 0 | 3 | 7 | 8 | 20 | ▆▇▅▁▁ |
ceb | 0 | 1.00 | 2.44 | 2.41 | 0 | 1 | 2 | 4 | 13 | ▇▃▁▁▁ |
agefbrth | 1088 | 0.75 | 19.01 | 3.09 | 10 | 17 | 19 | 20 | 38 | ▁▇▂▁▁ |
children | 0 | 1.00 | 2.27 | 2.22 | 0 | 0 | 2 | 4 | 13 | ▇▃▁▁▁ |
knowmeth | 7 | 1.00 | 0.96 | 0.19 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
usemeth | 71 | 0.98 | 0.58 | 0.49 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
monthfm | 2282 | 0.48 | 6.27 | 3.62 | 1 | 3 | 6 | 9 | 12 | ▇▃▅▅▇ |
yearfm | 2282 | 0.48 | 76.91 | 7.76 | 50 | 72 | 78 | 83 | 88 | ▁▂▃▇▇ |
agefm | 2282 | 0.48 | 20.69 | 5.00 | 10 | 17 | 20 | 23 | 46 | ▅▇▂▁▁ |
idlnchld | 120 | 0.97 | 4.62 | 2.22 | 0 | 3 | 4 | 6 | 20 | ▇▅▁▁▁ |
heduc | 2405 | 0.45 | 5.14 | 4.80 | 0 | 0 | 6 | 8 | 20 | ▇▅▃▁▁ |
agesq | 0 | 1.00 | 826.46 | 526.92 | 225 | 400 | 676 | 1089 | 2401 | ▇▅▂▁▁ |
urban | 0 | 1.00 | 0.52 | 0.50 | 0 | 0 | 1 | 1 | 1 | ▇▁▁▁▇ |
urb_educ | 0 | 1.00 | 3.47 | 4.29 | 0 | 0 | 0 | 7 | 20 | ▇▃▂▁▁ |
spirit | 0 | 1.00 | 0.42 | 0.49 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▆ |
protest | 0 | 1.00 | 0.23 | 0.42 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
catholic | 0 | 1.00 | 0.10 | 0.30 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
frsthalf | 0 | 1.00 | 0.54 | 0.50 | 0 | 0 | 1 | 1 | 1 | ▇▁▁▁▇ |
educ0 | 0 | 1.00 | 0.21 | 0.41 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
evermarr | 0 | 1.00 | 0.48 | 0.50 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▇ |
Suppose we want to see if education causes lower fertility (as can be seen when comparing more- and less-educated countries):
\[ children = \beta_0 + \beta_1 educ + \beta_2 age + \beta_3 age^2 + u \]
where \(children\) is the number of children born to the woman, \(educ\) is years of education, and \(age\) is age (in years).
est.ols.rob <- estimatr::lm_robust(children ~ educ + age + I(age^2), data=df)
est.ols <- lm(children ~ educ + age + I(age^2), data=df)
summary(est.ols)
##
## Call:
## lm(formula = children ~ educ + age + I(age^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8351 -0.7135 -0.0054 0.7141 7.5055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1383066 0.2405942 -17.200 <2e-16 ***
## educ -0.0905755 0.0059207 -15.298 <2e-16 ***
## age 0.3324486 0.0165495 20.088 <2e-16 ***
## I(age^2) -0.0026308 0.0002726 -9.651 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.46 on 4357 degrees of freedom
## Multiple R-squared: 0.5687, Adjusted R-squared: 0.5684
## F-statistic: 1915 on 3 and 4357 DF, p-value: < 2.2e-16
summary(est.ols.rob)
##
## Call:
## estimatr::lm_robust(formula = children ~ educ + age + I(age^2),
## data = df)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) -4.138307 0.2438139 -16.973 1.282e-62 -4.616306 -3.66031 4357
## educ -0.090575 0.0060502 -14.971 1.901e-49 -0.102437 -0.07871 4357
## age 0.332449 0.0192233 17.294 7.309e-65 0.294761 0.37014 4357
## I(age^2) -0.002631 0.0003523 -7.468 9.800e-14 -0.003321 -0.00194 4357
##
## Multiple R-squared: 0.5687 , Adjusted R-squared: 0.5684
## F-statistic: 1922 on 3 and 4357 DF, p-value: < 2.2e-16
(Note: include I(age^2)
puts the quadratic term in automatically without us having to use mutate()
to create a new variable called age.sq
.)
We can also use stargazer to examine the output. It puts the standard errors of each variable in parentheses under the estimated coefficient.
stargazer(est.ols, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## children
## -----------------------------------------------
## educ -0.091***
## (0.006)
##
## age 0.332***
## (0.017)
##
## I(age2) -0.003***
## (0.0003)
##
## Constant -4.138***
## (0.241)
##
## -----------------------------------------------
## Observations 4,361
## R2 0.569
## Adjusted R2 0.568
## Residual Std. Error 1.460 (df = 4357)
## F Statistic 1,915.196*** (df = 3; 4357)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
We know that education is endogenous (i.e. people choose the level of education that maximizes their utility). A possible instrument for education is \(firsthalf\), which is a dummy equal to 1 if the woman was born in the first half of the calendar year, and 0 otherwise.
Let’s create this variable:
df %<>%
dplyr::mutate(firsthalf = mnthborn<7)
We will assume that \(firsthalf\) is uncorrelated with \(u\).
est.iv1
.est.iv1 <- lm(educ ~ firsthalf, data=df)
stargazer(est.iv1, type="text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## educ
## -----------------------------------------------
## firsthalf -0.938***
## (0.118)
##
## Constant 6.363***
## (0.087)
##
## -----------------------------------------------
## Observations 4,361
## R2 0.014
## Adjusted R2 0.014
## Residual Std. Error 3.900 (df = 4359)
## F Statistic 62.620*** (df = 1; 4359)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Now let’s do the IV regression:
est.iv <- ivreg(children ~ educ + age + I(age^2) | firsthalf + age + I(age^2), data=df)
summary(est.iv)
##
## Call:
## ivreg(formula = children ~ educ + age + I(age^2) | firsthalf +
## age + I(age^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.05272 -0.71481 0.06224 0.76236 7.23693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.3878054 0.5481502 -6.180 6.98e-10 ***
## educ -0.1714989 0.0531796 -3.225 0.00127 **
## age 0.3236052 0.0178596 18.119 < 2e-16 ***
## I(age^2) -0.0026723 0.0002797 -9.555 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.491 on 4357 degrees of freedom
## Multiple R-Squared: 0.5502, Adjusted R-squared: 0.5499
## Wald test: 1765 on 3 and 4357 DF, p-value: < 2.2e-16
summary(est.ols)
##
## Call:
## lm(formula = children ~ educ + age + I(age^2), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8351 -0.7135 -0.0054 0.7141 7.5055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.1383066 0.2405942 -17.200 <2e-16 ***
## educ -0.0905755 0.0059207 -15.298 <2e-16 ***
## age 0.3324486 0.0165495 20.088 <2e-16 ***
## I(age^2) -0.0026308 0.0002726 -9.651 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.46 on 4357 degrees of freedom
## Multiple R-squared: 0.5687, Adjusted R-squared: 0.5684
## F-statistic: 1915 on 3 and 4357 DF, p-value: < 2.2e-16
The variables on the right hand side of the |
are the instruments (including the \(x\)’s that we assume to be exogenous, like \(age\)). The endogenous \(x\) is the first one after the ~
.
Now we can compare the output for each of the models:
stargazer(est.ols,est.iv1,est.iv, type="text")
##
## ==========================================================================================
## Dependent variable:
## ----------------------------------------------------------------------
## children educ children
## OLS OLS instrumental
## variable
## (1) (2) (3)
## ------------------------------------------------------------------------------------------
## educ -0.091*** -0.171***
## (0.006) (0.053)
##
## age 0.332*** 0.324***
## (0.017) (0.018)
##
## I(age2) -0.003*** -0.003***
## (0.0003) (0.0003)
##
## firsthalf -0.938***
## (0.118)
##
## Constant -4.138*** 6.363*** -3.388***
## (0.241) (0.087) (0.548)
##
## ------------------------------------------------------------------------------------------
## Observations 4,361 4,361 4,361
## R2 0.569 0.014 0.550
## Adjusted R2 0.568 0.014 0.550
## Residual Std. Error 1.460 (df = 4357) 3.900 (df = 4359) 1.491 (df = 4357)
## F Statistic 1,915.196*** (df = 3; 4357) 62.620*** (df = 1; 4359)
## ==========================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01