13工具变量估计

The purpose of this in-class lab13 is to use R to practice with instrumental variables estimation. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

13.1 For starters

You may need to install the package AER. (It may have already been installed when you previously installed car and zoo.)

Open up a new R script (named ICL13_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

# Add names of group members HERE
library(tidyverse)
library(wooldridge)
library(broom)
library(AER)
library(magrittr)
library(stargazer)

13.1.1 Load the data

We’re going to use data on fertility of Botswanian women.

df <- as_tibble(fertil2)
df %>% glimpse()

## Observations: 4,361
## Variables: 27
## $ mnthborn <int> 5, 1, 7, 11, 5, 8, 7, 9, 12, 9, 6, 10, 12, 2, 1, 6, 1, 8, ...
## $ yearborn <int> 64, 56, 58, 45, 45, 52, 51, 70, 53, 39, 46, 59, 42, 40, 53...
## $ age      <int> 24, 32, 30, 42, 43, 36, 37, 18, 34, 49, 42, 29, 45, 48, 35...
## $ electric <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ radio    <int> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ tv       <int> 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1...
## $ bicycle  <int> 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0...
## $ educ     <int> 12, 13, 5, 4, 11, 7, 16, 10, 5, 4, 15, 7, 0, 4, 12, 7, 7, ...
## $ ceb      <int> 0, 3, 1, 3, 2, 1, 4, 0, 1, 0, 3, 3, 4, 10, 3, 0, 4, 2, 0, ...
## $ agefbrth <int> NA, 25, 27, 17, 24, 26, 20, NA, 19, NA, 25, 23, 18, 19, 23...
## $ children <int> 0, 3, 1, 2, 2, 1, 4, 0, 1, 0, 3, 3, 2, 8, 3, 0, 4, 2, 0, 1...
## $ knowmeth <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ usemeth  <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1...
## $ monthfm  <int> NA, 11, 6, 1, 3, 11, 5, NA, 7, 11, 6, 1, 1, 10, 1, NA, NA,...
## $ yearfm   <int> NA, 80, 83, 61, 66, 76, 78, NA, 72, 61, 70, 84, 66, 66, 74...
## $ agefm    <int> NA, 24, 24, 15, 20, 24, 26, NA, 18, 22, 24, 24, 23, 26, 21...
## $ idlnchld <int> 2, 3, 5, 3, 2, 4, 4, 4, 4, 4, 3, 6, 6, 4, 3, 4, 5, 1, 2, 3...
## $ heduc    <int> NA, 12, 7, 11, 14, 9, 17, NA, 3, 1, 16, 7, NA, 3, 16, NA, ...
## $ agesq    <int> 576, 1024, 900, 1764, 1849, 1296, 1369, 324, 1156, 2401, 1...
## $ urban    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ urb_educ <int> 12, 13, 5, 4, 11, 7, 16, 10, 5, 4, 15, 7, 0, 4, 12, 7, 7, ...
## $ spirit   <int> 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1...
## $ protest  <int> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0...
## $ catholic <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ frsthalf <int> 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0...
## $ educ0    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
## $ evermarr <int> 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1...

13.1.2 Summary statistics

We can easily compute summary statistics of our data by using the stargazer package:

df %>% as.data.frame %>% stargazer(type="text")

## 
## ================================================================
## Statistic   N    Mean   St. Dev.  Min   Pctl(25) Pctl(75)  Max  
## ----------------------------------------------------------------
## mnthborn  4,361  6.331   3.323     1       3        9       12  
## yearborn  4,361 60.434   8.683     38      55       68      73  
## age       4,361 27.405   8.685     15      20       33      49  
## electric  4,358  0.140   0.347   0.000   0.000    0.000   1.000 
## radio     4,359  0.702   0.458   0.000   0.000    1.000   1.000 
## tv        4,359  0.093   0.290   0.000   0.000    0.000   1.000 
## bicycle   4,358  0.276   0.447   0.000   0.000    1.000   1.000 
## educ      4,361  5.856   3.927     0       3        8       20  
## ceb       4,361  2.442   2.407     0       1        4       13  
## agefbrth  3,273 19.011   3.092   10.000  17.000   20.000  38.000
## children  4,361  2.268   2.222     0       0        4       13  
## knowmeth  4,354  0.963   0.188   0.000   1.000    1.000   1.000 
## usemeth   4,290  0.578   0.494   0.000   0.000    1.000   1.000 
## monthfm   2,079  6.270   3.620   1.000   3.000    9.000   12.000
## yearfm    2,079 76.912   7.760   50.000  72.000   83.000  88.000
## agefm     2,079 20.686   5.002   10.000  17.000   23.000  46.000
## idlnchld  4,241  4.616   2.219   0.000   3.000    6.000   20.000
## heduc     1,956  5.145   4.803   0.000   0.000    8.000   20.000
## agesq     4,361 826.460 526.923   225     400     1,089   2,401 
## urban     4,361  0.517   0.500     0       0        1       1   
## urb_educ  4,361  3.469   4.294     0       0        7       20  
## spirit    4,361  0.422   0.494     0       0        1       1   
## protest   4,361  0.228   0.419     0       0        0       1   
## catholic  4,361  0.102   0.303     0       0        0       1   
## frsthalf  4,361  0.540   0.498     0       0        1       1   
## educ0     4,361  0.208   0.406     0       0        0       1   
## evermarr  4,361  0.477   0.500     0       0        1       1   
## ----------------------------------------------------------------

or:

df %>% as.data.frame() %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	4361
Number of columns	27
_______________________
Column type frequency:
numeric	27
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
mnthborn	0	1.00	6.33	3.32	1	3	6	9	12	▇▅▇▆▆
yearborn	0	1.00	60.43	8.68	38	55	62	68	73	▂▃▆▇▇
age	0	1.00	27.41	8.69	15	20	26	33	49	▇▇▆▃▂
electric	3	1.00	0.14	0.35	0	0	0	0	1	▇▁▁▁▁
radio	2	1.00	0.70	0.46	0	0	1	1	1	▃▁▁▁▇
tv	2	1.00	0.09	0.29	0	0	0	0	1	▇▁▁▁▁
bicycle	3	1.00	0.28	0.45	0	0	0	1	1	▇▁▁▁▃
educ	0	1.00	5.86	3.93	0	3	7	8	20	▆▇▅▁▁
ceb	0	1.00	2.44	2.41	0	1	2	4	13	▇▃▁▁▁
agefbrth	1088	0.75	19.01	3.09	10	17	19	20	38	▁▇▂▁▁
children	0	1.00	2.27	2.22	0	0	2	4	13	▇▃▁▁▁
knowmeth	7	1.00	0.96	0.19	0	1	1	1	1	▁▁▁▁▇
usemeth	71	0.98	0.58	0.49	0	0	1	1	1	▆▁▁▁▇
monthfm	2282	0.48	6.27	3.62	1	3	6	9	12	▇▃▅▅▇
yearfm	2282	0.48	76.91	7.76	50	72	78	83	88	▁▂▃▇▇
agefm	2282	0.48	20.69	5.00	10	17	20	23	46	▅▇▂▁▁
idlnchld	120	0.97	4.62	2.22	0	3	4	6	20	▇▅▁▁▁
heduc	2405	0.45	5.14	4.80	0	0	6	8	20	▇▅▃▁▁
agesq	0	1.00	826.46	526.92	225	400	676	1089	2401	▇▅▂▁▁
urban	0	1.00	0.52	0.50	0	0	1	1	1	▇▁▁▁▇
urb_educ	0	1.00	3.47	4.29	0	0	0	7	20	▇▃▂▁▁
spirit	0	1.00	0.42	0.49	0	0	0	1	1	▇▁▁▁▆
protest	0	1.00	0.23	0.42	0	0	0	0	1	▇▁▁▁▂
catholic	0	1.00	0.10	0.30	0	0	0	0	1	▇▁▁▁▁
frsthalf	0	1.00	0.54	0.50	0	0	1	1	1	▇▁▁▁▇
educ0	0	1.00	0.21	0.41	0	0	0	0	1	▇▁▁▁▂
evermarr	0	1.00	0.48	0.50	0	0	0	1	1	▇▁▁▁▇

What do you think is going on when you see varying numbers of observations across the different variables?

13.2 Determinants of fertility

Suppose we want to see if education causes lower fertility (as can be seen when comparing more- and less-educated countries):

\[ children = \beta_0 + \beta_1 educ + \beta_2 age + \beta_3 age^2 + u \]

where \(children\) is the number of children born to the woman, \(educ\) is years of education, and \(age\) is age (in years).

Interpret the estimates of the regression:

est.ols.rob <- estimatr::lm_robust(children ~ educ + age + I(age^2), data=df)
est.ols <- lm(children ~ educ + age + I(age^2), data=df)

summary(est.ols)

## 
## Call:
## lm(formula = children ~ educ + age + I(age^2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8351 -0.7135 -0.0054  0.7141  7.5055 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.1383066  0.2405942 -17.200   <2e-16 ***
## educ        -0.0905755  0.0059207 -15.298   <2e-16 ***
## age          0.3324486  0.0165495  20.088   <2e-16 ***
## I(age^2)    -0.0026308  0.0002726  -9.651   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.46 on 4357 degrees of freedom
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5684 
## F-statistic:  1915 on 3 and 4357 DF,  p-value: < 2.2e-16

summary(est.ols.rob)

## 
## Call:
## estimatr::lm_robust(formula = children ~ educ + age + I(age^2), 
##     data = df)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)  CI Lower CI Upper   DF
## (Intercept) -4.138307  0.2438139 -16.973 1.282e-62 -4.616306 -3.66031 4357
## educ        -0.090575  0.0060502 -14.971 1.901e-49 -0.102437 -0.07871 4357
## age          0.332449  0.0192233  17.294 7.309e-65  0.294761  0.37014 4357
## I(age^2)    -0.002631  0.0003523  -7.468 9.800e-14 -0.003321 -0.00194 4357
## 
## Multiple R-squared:  0.5687 ,    Adjusted R-squared:  0.5684 
## F-statistic:  1922 on 3 and 4357 DF,  p-value: < 2.2e-16

(Note: include I(age^2) puts the quadratic term in automatically without us having to use mutate() to create a new variable called age.sq.)

We can also use stargazer to examine the output. It puts the standard errors of each variable in parentheses under the estimated coefficient.

stargazer(est.ols, type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              children          
## -----------------------------------------------
## educ                         -0.091***         
##                               (0.006)          
##                                                
## age                          0.332***          
##                               (0.017)          
##                                                
## I(age2)                      -0.003***         
##                              (0.0003)          
##                                                
## Constant                     -4.138***         
##                               (0.241)          
##                                                
## -----------------------------------------------
## Observations                   4,361           
## R2                             0.569           
## Adjusted R2                    0.568           
## Residual Std. Error      1.460 (df = 4357)     
## F Statistic         1,915.196*** (df = 3; 4357)
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

13.2.1 Instrumenting for endogenous education

We know that education is endogenous (i.e. people choose the level of education that maximizes their utility). A possible instrument for education is \(firsthalf\), which is a dummy equal to 1 if the woman was born in the first half of the calendar year, and 0 otherwise.

Let’s create this variable:

df %<>% 
  dplyr::mutate(firsthalf = mnthborn<7)

We will assume that \(firsthalf\) is uncorrelated with \(u\).

Check that \(firsthalf\) is correlated with \(educ\) by running a regression. Call the output est.iv1.

est.iv1 <- lm(educ ~ firsthalf, data=df)
stargazer(est.iv1, type="text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                educ            
## -----------------------------------------------
## firsthalf                    -0.938***         
##                               (0.118)          
##                                                
## Constant                     6.363***          
##                               (0.087)          
##                                                
## -----------------------------------------------
## Observations                   4,361           
## R2                             0.014           
## Adjusted R2                    0.014           
## Residual Std. Error      3.900 (df = 4359)     
## F Statistic          62.620*** (df = 1; 4359)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

13.2.2 IV estimation

Now let’s do the IV regression:

est.iv <- ivreg(children ~ educ + age + I(age^2) | firsthalf + age + I(age^2), data=df)

summary(est.iv)

## 
## Call:
## ivreg(formula = children ~ educ + age + I(age^2) | firsthalf + 
##     age + I(age^2), data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -6.05272 -0.71481  0.06224  0.76236  7.23693 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.3878054  0.5481502  -6.180 6.98e-10 ***
## educ        -0.1714989  0.0531796  -3.225  0.00127 ** 
## age          0.3236052  0.0178596  18.119  < 2e-16 ***
## I(age^2)    -0.0026723  0.0002797  -9.555  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.491 on 4357 degrees of freedom
## Multiple R-Squared: 0.5502,  Adjusted R-squared: 0.5499 
## Wald test:  1765 on 3 and 4357 DF,  p-value: < 2.2e-16

summary(est.ols)

## 
## Call:
## lm(formula = children ~ educ + age + I(age^2), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8351 -0.7135 -0.0054  0.7141  7.5055 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.1383066  0.2405942 -17.200   <2e-16 ***
## educ        -0.0905755  0.0059207 -15.298   <2e-16 ***
## age          0.3324486  0.0165495  20.088   <2e-16 ***
## I(age^2)    -0.0026308  0.0002726  -9.651   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.46 on 4357 degrees of freedom
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5684 
## F-statistic:  1915 on 3 and 4357 DF,  p-value: < 2.2e-16

The variables on the right hand side of the | are the instruments (including the \(x\)’s that we assume to be exogenous, like \(age\)). The endogenous \(x\) is the first one after the ~.

Now we can compare the output for each of the models:

stargazer(est.ols,est.iv1,est.iv, type="text")

## 
## ==========================================================================================
##                                              Dependent variable:                          
##                     ----------------------------------------------------------------------
##                              children                     educ               children     
##                                 OLS                       OLS              instrumental   
##                                                                              variable     
##                                 (1)                       (2)                   (3)       
## ------------------------------------------------------------------------------------------
## educ                         -0.091***                                       -0.171***    
##                               (0.006)                                         (0.053)     
##                                                                                           
## age                          0.332***                                        0.324***     
##                               (0.017)                                         (0.018)     
##                                                                                           
## I(age2)                      -0.003***                                       -0.003***    
##                              (0.0003)                                        (0.0003)     
##                                                                                           
## firsthalf                                              -0.938***                          
##                                                         (0.118)                           
##                                                                                           
## Constant                     -4.138***                  6.363***             -3.388***    
##                               (0.241)                   (0.087)               (0.548)     
##                                                                                           
## ------------------------------------------------------------------------------------------
## Observations                   4,361                     4,361                 4,361      
## R2                             0.569                     0.014                 0.550      
## Adjusted R2                    0.568                     0.014                 0.550      
## Residual Std. Error      1.460 (df = 4357)         3.900 (df = 4359)     1.491 (df = 4357)
## F Statistic         1,915.196*** (df = 3; 4357) 62.620*** (df = 1; 4359)                  
## ==========================================================================================
## Note:                                                          *p<0.1; **p<0.05; ***p<0.01

Comment on the IV estimates. Do they make sense? Discuss why the IV standard error is so much larger than the OLS standard error.