9Heteroskedasticity-robust推断

The purpose of this in-class lab9 is to use R to practice testing for the presence of heteroskedasticity in regression models. We will do this using the Breusch-Pagan and White tests. We will also use the Lagrange Multiplier (LM) test. The lab9 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

9.1 For starters

First, install the lmtest package.

Open up a new R script (named ICL9_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(car)
library(lmtest)
library(estimatr)
library(magrittr)

9.1.1 Load the data

We’ll use a data set on college GPA, called gpa3. The data set contains a sample of 732 students.

df <- as_tibble(gpa3)
df %>% glimpse()

## Observations: 732
## Variables: 23
## $ term     <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2...
## $ sat      <int> 920, 920, 780, 780, 810, 810, 1080, 1080, 960, 960, 790, 7...
## $ tothrs   <int> 31, 43, 28, 43, 0, 14, 0, 17, 91, 106, 0, 12, 25, 41, 30, ...
## $ cumgpa   <dbl> 2.25, 2.04, 2.03, 2.09, 0.00, 1.78, 0.00, 2.00, 2.35, 2.41...
## $ season   <int> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0...
## $ frstsem  <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ crsgpa   <dbl> 2.6464, 2.5077, 2.8679, 2.8839, 2.7634, 2.5600, 2.5149, 2....
## $ verbmath <dbl> 0.48387, 0.48387, 0.81395, 0.81395, 0.88372, 0.88372, 0.80...
## $ trmgpa   <dbl> 1.50, 2.25, 2.20, 1.60, 1.60, 1.29, 2.00, 2.73, 2.80, 2.60...
## $ hssize   <int> 10, 10, 123, 123, 119, 119, 318, 318, 383, 383, 44, 44, 34...
## $ hsrank   <int> 4, 4, 102, 102, 42, 42, 31, 31, 66, 66, 27, 27, 36, 36, 15...
## $ id       <dbl> 22, 22, 35, 35, 36, 36, 156, 156, 246, 246, 264, 264, 326,...
## $ spring   <int> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1...
## $ female   <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0...
## $ black    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1...
## $ white    <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0...
## $ ctrmgpa  <dbl> NA, 0.75000000, NA, -0.60000002, NA, -0.31000006, NA, 0.73...
## $ ctothrs  <int> NA, 12, NA, 15, NA, 14, NA, 17, NA, 15, NA, 12, NA, 16, NA...
## $ ccrsgpa  <dbl> NA, -0.13870001, NA, 0.01600003, NA, -0.20340014, NA, -0.0...
## $ ccrspop  <dbl> NA, -62.25000, NA, -73.25000, NA, 72.32996, NA, -320.19995...
## $ cseason  <int> NA, 1, NA, 1, NA, 1, NA, -1, NA, -1, NA, -1, NA, 0, NA, -1...
## $ hsperc   <dbl> 40.000000, 40.000000, 82.926826, 82.926826, 35.294117, 35....
## $ football <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...

Check out what’s in the data by typing

glimpse(df)

## Observations: 732
## Variables: 23
## $ term     <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2...
## $ sat      <int> 920, 920, 780, 780, 810, 810, 1080, 1080, 960, 960, 790, 7...
## $ tothrs   <int> 31, 43, 28, 43, 0, 14, 0, 17, 91, 106, 0, 12, 25, 41, 30, ...
## $ cumgpa   <dbl> 2.25, 2.04, 2.03, 2.09, 0.00, 1.78, 0.00, 2.00, 2.35, 2.41...
## $ season   <int> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0...
## $ frstsem  <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ crsgpa   <dbl> 2.6464, 2.5077, 2.8679, 2.8839, 2.7634, 2.5600, 2.5149, 2....
## $ verbmath <dbl> 0.48387, 0.48387, 0.81395, 0.81395, 0.88372, 0.88372, 0.80...
## $ trmgpa   <dbl> 1.50, 2.25, 2.20, 1.60, 1.60, 1.29, 2.00, 2.73, 2.80, 2.60...
## $ hssize   <int> 10, 10, 123, 123, 119, 119, 318, 318, 383, 383, 44, 44, 34...
## $ hsrank   <int> 4, 4, 102, 102, 42, 42, 31, 31, 66, 66, 27, 27, 36, 36, 15...
## $ id       <dbl> 22, 22, 35, 35, 36, 36, 156, 156, 246, 246, 264, 264, 326,...
## $ spring   <int> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1...
## $ female   <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0...
## $ black    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1...
## $ white    <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0...
## $ ctrmgpa  <dbl> NA, 0.75000000, NA, -0.60000002, NA, -0.31000006, NA, 0.73...
## $ ctothrs  <int> NA, 12, NA, 15, NA, 14, NA, 17, NA, 15, NA, 12, NA, 16, NA...
## $ ccrsgpa  <dbl> NA, -0.13870001, NA, 0.01600003, NA, -0.20340014, NA, -0.0...
## $ ccrspop  <dbl> NA, -62.25000, NA, -73.25000, NA, 72.32996, NA, -320.19995...
## $ cseason  <int> NA, 1, NA, 1, NA, 1, NA, -1, NA, -1, NA, -1, NA, 0, NA, -1...
## $ hsperc   <dbl> 40.000000, 40.000000, 82.926826, 82.926826, 35.294117, 35....
## $ football <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...

The main variables we’re interested in are: SAT, high school percentile, credit hours, gender and race. We also only want to look at observations that are in the Spring semester.

9.1.2 Restrict to observations in spring semester

Use a filter() statement to drop observations not in the Spring semester.

df %<>% 
  dplyr::filter(spring==1)

9.1.3 Get rid of variables you won’t use

Use a select() statement to keep only the variables that will be used:

df %<>% select(cumgpa,
               sat,
               hsperc,
               tothrs,
               female,
               black,
               white)

Look at the data to make sure the code worked as expected. You should now have 366 observations and 7 variables.

df %>% datatable()  # finnal variables

df %>% glimpse()

## Observations: 366
## Variables: 7
## $ cumgpa <dbl> 2.04, 2.09, 1.78, 2.00, 2.41, 1.16, 1.87, 2.09, 2.17, 1.90, ...
## $ sat    <int> 920, 780, 810, 1080, 960, 790, 820, 820, 1040, 730, 780, 830...
## $ hsperc <dbl> 40.000000, 82.926826, 35.294117, 9.748427, 17.232376, 61.363...
## $ tothrs <int> 43, 43, 14, 17, 106, 12, 41, 42, 17, 50, 71, 74, 132, 47, 10...
## $ female <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ black  <int> 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, ...
## $ white  <int> 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, ...

9.2 Testing for Heteroskedasticity

Estimate the following regression model.

\[ cumpga = \beta_0 + \beta_1 sat + \beta_2 hsperc + \beta_3 tothrs + \beta_4 female + \beta_5 black + \beta_6 white + u \]

est <- lm(cumgpa ~ sat + hsperc + tothrs + female + black + white, data=df)
tidy(est)

## # A tibble: 7 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.47     0.230        6.40  4.94e-10
## 2 sat          0.00114  0.000179     6.39  5.20e-10
## 3 hsperc      -0.00857  0.00124     -6.91  2.27e-11
## 4 tothrs       0.00250  0.000731     3.43  6.85e- 4
## 5 female       0.303    0.0590       5.14  4.50e- 7
## 6 black       -0.128    0.147       -0.870 3.85e- 1
## 7 white       -0.0587   0.141       -0.416 6.77e- 1

summary(est)

## 
## Call:
## lm(formula = cumgpa ~ sat + hsperc + tothrs + female + black + 
##     white, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54320 -0.29104 -0.02252  0.28348  1.24872 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.4700648  0.2298031   6.397 4.94e-10 ***
## sat          0.0011407  0.0001786   6.389 5.20e-10 ***
## hsperc      -0.0085664  0.0012404  -6.906 2.27e-11 ***
## tothrs       0.0025040  0.0007310   3.426 0.000685 ***
## female       0.3034333  0.0590203   5.141 4.50e-07 ***
## black       -0.1282837  0.1473701  -0.870 0.384616    
## white       -0.0587217  0.1409896  -0.416 0.677295    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4693 on 359 degrees of freedom
## Multiple R-squared:  0.4006, Adjusted R-squared:  0.3905 
## F-statistic: 39.98 on 6 and 359 DF,  p-value: < 2.2e-16

9.2.1 Breusch-Pagan and White tests for Heteroskedasticity

To conduct the Breusch-Pagan test for heteroskedasticity, we use the bptest() function from the lmtest package:

bptest(est)

## 
##  studentized Breusch-Pagan test
## 
## data:  est
## BP = 44.557, df = 6, p-value = 5.732e-08

To do the White test, we simply modify the arguments in the bptest() function:

bptest(est, ~ fitted(est) + I(fitted(est)^2) )

## 
##  studentized Breusch-Pagan test
## 
## data:  est
## BP = 1.2645, df = 2, p-value = 0.5314

Based on the results of each test, can you reject the null hypothesis of homoskedasticity?

9.2.2 Inference with Heteroskedasticity-Robust Standard Errors

Now let’s obtain standard errors from the above regression that are robust to heteroskedasticity. To do so, we use the lm_robust() function from the estimatr package. This function works like regular lm() but instead reports a refined version of White’s robust standard errors.

est.rob <- lm_robust(cumgpa ~ sat + hsperc + tothrs + female + black + white, data=df)
summary(est.rob)

## 
## Call:
## lm_robust(formula = cumgpa ~ sat + hsperc + tothrs + female + 
##     black + white, data = df)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)   CI Lower  CI Upper  DF
## (Intercept)  1.470065  0.2238293  6.5678 1.794e-10  1.0298834  1.910246 359
## sat          0.001141  0.0001925  5.9268 7.249e-09  0.0007622  0.001519 359
## hsperc      -0.008566  0.0014237 -6.0172 4.380e-09 -0.0113661 -0.005767 359
## tothrs       0.002504  0.0007413  3.3776 8.112e-04  0.0010461  0.003962 359
## female       0.303433  0.0592945  5.1174 5.058e-07  0.1868250  0.420042 359
## black       -0.128284  0.1230101 -1.0429 2.977e-01 -0.3701947  0.113627 359
## white       -0.058722  0.1152462 -0.5095 6.107e-01 -0.2853642  0.167921 359
## 
## Multiple R-squared:  0.4006 ,    Adjusted R-squared:  0.3905 
## F-statistic: 39.17 on 6 and 359 DF,  p-value: < 2.2e-16

Compare your new estimates with the original ones. Are any of the default hypothesis test results overturned?

tidy(est)

## # A tibble: 7 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  1.47     0.230        6.40  4.94e-10
## 2 sat          0.00114  0.000179     6.39  5.20e-10
## 3 hsperc      -0.00857  0.00124     -6.91  2.27e-11
## 4 tothrs       0.00250  0.000731     3.43  6.85e- 4
## 5 female       0.303    0.0590       5.14  4.50e- 7
## 6 black       -0.128    0.147       -0.870 3.85e- 1
## 7 white       -0.0587   0.141       -0.416 6.77e- 1

tidy(est.rob)

##          term     estimate    std.error  statistic      p.value      conf.low
## 1 (Intercept)  1.470064766 0.2238293185  6.5677936 1.794270e-10  1.0298833856
## 2         sat  0.001140728 0.0001924686  5.9268256 7.248729e-09  0.0007622202
## 3      hsperc -0.008566358 0.0014236525 -6.0171689 4.379958e-09 -0.0113661039
## 4      tothrs  0.002503998 0.0007413439  3.3776471 8.112185e-04  0.0010460757
## 5      female  0.303433294 0.0592945229  5.1173916 5.057872e-07  0.1868250448
## 6       black -0.128283685 0.1230101308 -1.0428709 2.977098e-01 -0.3701946623
## 7       white -0.058721727 0.1152461964 -0.5095329 6.106918e-01 -0.2853641981
##      conf.high  df outcome
## 1  1.910246147 359  cumgpa
## 2  0.001519235 359  cumgpa
## 3 -0.005766611 359  cumgpa
## 4  0.003961921 359  cumgpa
## 5  0.420041544 359  cumgpa
## 6  0.113627293 359  cumgpa
## 7  0.167920744 359  cumgpa

Now look at the robust version of the overall F-test. Is its conclusion changed?

glance(est)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.401         0.391 0.469      40.0 3.41e-37     7  -239.  494.  525.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

linearHypothesis(est.rob, c('sat','hsperc','tothrs','female','black','white'))

## Linear hypothesis test
## 
## Hypothesis:
## sat = 0
## hsperc = 0
## tothrs = 0
## female = 0
## black = 0
## white = 0
## 
## Model 1: restricted model
## Model 2: cumgpa ~ sat + hsperc + tothrs + female + black + white
## 
##   Res.Df Df Chisq Pr(>Chisq)    
## 1    365                        
## 2    359  6   235  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.2.3 The LM test

The LM test is an alternative to the overall F test that is reported in glimpse(est). To perform the LM test, we need to do the following:

Estimate the restricted model (in the current case, this is an intercept-only model) and then obtain the residuals from that.
Regress the residuals from (a) on the regressors in the full model.
the LM statistic is equal to \(N*R^2\) from the regression in the second bullet.

Conduct an LM test following the steps above (also on p.159 of @wooldridge)

# Restricted model
restr <- lm(cumgpa ~ 1, data=df)
LMreg <- lm(resid(restr) ~ sat + hsperc + tothrs + female + black + white, data=df)
LM    <- nobs(LMreg)*glance(LMreg)$r.squared
pval  <- 1-pchisq(LM,6)
pval

## [1] 0

Compare the p-value from the LM test with the p-value for the overall F test (with and without heteroskedasticity-robust correction).

9.3 ## References