The purpose of this in-class lab9 is to use R to practice testing for the presence of heteroskedasticity in regression models. We will do this using the Breusch-Pagan and White tests. We will also use the Lagrange Multiplier (LM) test. The lab9 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
First, install the lmtest
package.
Open up a new R script (named ICL9_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(car)
library(lmtest)
library(estimatr)
library(magrittr)
We’ll use a data set on college GPA, called gpa3
. The data set contains a sample of 732 students.
df <- as_tibble(gpa3)
df %>% glimpse()
## Observations: 732
## Variables: 23
## $ term <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2...
## $ sat <int> 920, 920, 780, 780, 810, 810, 1080, 1080, 960, 960, 790, 7...
## $ tothrs <int> 31, 43, 28, 43, 0, 14, 0, 17, 91, 106, 0, 12, 25, 41, 30, ...
## $ cumgpa <dbl> 2.25, 2.04, 2.03, 2.09, 0.00, 1.78, 0.00, 2.00, 2.35, 2.41...
## $ season <int> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0...
## $ frstsem <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ crsgpa <dbl> 2.6464, 2.5077, 2.8679, 2.8839, 2.7634, 2.5600, 2.5149, 2....
## $ verbmath <dbl> 0.48387, 0.48387, 0.81395, 0.81395, 0.88372, 0.88372, 0.80...
## $ trmgpa <dbl> 1.50, 2.25, 2.20, 1.60, 1.60, 1.29, 2.00, 2.73, 2.80, 2.60...
## $ hssize <int> 10, 10, 123, 123, 119, 119, 318, 318, 383, 383, 44, 44, 34...
## $ hsrank <int> 4, 4, 102, 102, 42, 42, 31, 31, 66, 66, 27, 27, 36, 36, 15...
## $ id <dbl> 22, 22, 35, 35, 36, 36, 156, 156, 246, 246, 264, 264, 326,...
## $ spring <int> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1...
## $ female <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0...
## $ black <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1...
## $ white <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0...
## $ ctrmgpa <dbl> NA, 0.75000000, NA, -0.60000002, NA, -0.31000006, NA, 0.73...
## $ ctothrs <int> NA, 12, NA, 15, NA, 14, NA, 17, NA, 15, NA, 12, NA, 16, NA...
## $ ccrsgpa <dbl> NA, -0.13870001, NA, 0.01600003, NA, -0.20340014, NA, -0.0...
## $ ccrspop <dbl> NA, -62.25000, NA, -73.25000, NA, 72.32996, NA, -320.19995...
## $ cseason <int> NA, 1, NA, 1, NA, 1, NA, -1, NA, -1, NA, -1, NA, 0, NA, -1...
## $ hsperc <dbl> 40.000000, 40.000000, 82.926826, 82.926826, 35.294117, 35....
## $ football <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...
Check out what’s in the data by typing
glimpse(df)
## Observations: 732
## Variables: 23
## $ term <int> 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2...
## $ sat <int> 920, 920, 780, 780, 810, 810, 1080, 1080, 960, 960, 790, 7...
## $ tothrs <int> 31, 43, 28, 43, 0, 14, 0, 17, 91, 106, 0, 12, 25, 41, 30, ...
## $ cumgpa <dbl> 2.25, 2.04, 2.03, 2.09, 0.00, 1.78, 0.00, 2.00, 2.35, 2.41...
## $ season <int> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0...
## $ frstsem <int> 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ crsgpa <dbl> 2.6464, 2.5077, 2.8679, 2.8839, 2.7634, 2.5600, 2.5149, 2....
## $ verbmath <dbl> 0.48387, 0.48387, 0.81395, 0.81395, 0.88372, 0.88372, 0.80...
## $ trmgpa <dbl> 1.50, 2.25, 2.20, 1.60, 1.60, 1.29, 2.00, 2.73, 2.80, 2.60...
## $ hssize <int> 10, 10, 123, 123, 119, 119, 318, 318, 383, 383, 44, 44, 34...
## $ hsrank <int> 4, 4, 102, 102, 42, 42, 31, 31, 66, 66, 27, 27, 36, 36, 15...
## $ id <dbl> 22, 22, 35, 35, 36, 36, 156, 156, 246, 246, 264, 264, 326,...
## $ spring <int> 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1...
## $ female <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0...
## $ black <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1...
## $ white <int> 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0...
## $ ctrmgpa <dbl> NA, 0.75000000, NA, -0.60000002, NA, -0.31000006, NA, 0.73...
## $ ctothrs <int> NA, 12, NA, 15, NA, 14, NA, 17, NA, 15, NA, 12, NA, 16, NA...
## $ ccrsgpa <dbl> NA, -0.13870001, NA, 0.01600003, NA, -0.20340014, NA, -0.0...
## $ ccrspop <dbl> NA, -62.25000, NA, -73.25000, NA, 72.32996, NA, -320.19995...
## $ cseason <int> NA, 1, NA, 1, NA, 1, NA, -1, NA, -1, NA, -1, NA, 0, NA, -1...
## $ hsperc <dbl> 40.000000, 40.000000, 82.926826, 82.926826, 35.294117, 35....
## $ football <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1...
The main variables we’re interested in are: SAT, high school percentile, credit hours, gender and race. We also only want to look at observations that are in the Spring semester.
Use a filter()
statement to drop observations not in the Spring semester.
df %<>%
dplyr::filter(spring==1)
Use a select()
statement to keep only the variables that will be used:
df %<>% select(cumgpa,
sat,
hsperc,
tothrs,
female,
black,
white)
Look at the data to make sure the code worked as expected. You should now have 366 observations and 7 variables.
df %>% datatable() # finnal variables
df %>% glimpse()
## Observations: 366
## Variables: 7
## $ cumgpa <dbl> 2.04, 2.09, 1.78, 2.00, 2.41, 1.16, 1.87, 2.09, 2.17, 1.90, ...
## $ sat <int> 920, 780, 810, 1080, 960, 790, 820, 820, 1040, 730, 780, 830...
## $ hsperc <dbl> 40.000000, 82.926826, 35.294117, 9.748427, 17.232376, 61.363...
## $ tothrs <int> 43, 43, 14, 17, 106, 12, 41, 42, 17, 50, 71, 74, 132, 47, 10...
## $ female <int> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...
## $ black <int> 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, ...
## $ white <int> 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, ...
Estimate the following regression model.
\[ cumpga = \beta_0 + \beta_1 sat + \beta_2 hsperc + \beta_3 tothrs + \beta_4 female + \beta_5 black + \beta_6 white + u \]
est <- lm(cumgpa ~ sat + hsperc + tothrs + female + black + white, data=df)
tidy(est)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.47 0.230 6.40 4.94e-10
## 2 sat 0.00114 0.000179 6.39 5.20e-10
## 3 hsperc -0.00857 0.00124 -6.91 2.27e-11
## 4 tothrs 0.00250 0.000731 3.43 6.85e- 4
## 5 female 0.303 0.0590 5.14 4.50e- 7
## 6 black -0.128 0.147 -0.870 3.85e- 1
## 7 white -0.0587 0.141 -0.416 6.77e- 1
summary(est)
##
## Call:
## lm(formula = cumgpa ~ sat + hsperc + tothrs + female + black +
## white, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54320 -0.29104 -0.02252 0.28348 1.24872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4700648 0.2298031 6.397 4.94e-10 ***
## sat 0.0011407 0.0001786 6.389 5.20e-10 ***
## hsperc -0.0085664 0.0012404 -6.906 2.27e-11 ***
## tothrs 0.0025040 0.0007310 3.426 0.000685 ***
## female 0.3034333 0.0590203 5.141 4.50e-07 ***
## black -0.1282837 0.1473701 -0.870 0.384616
## white -0.0587217 0.1409896 -0.416 0.677295
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4693 on 359 degrees of freedom
## Multiple R-squared: 0.4006, Adjusted R-squared: 0.3905
## F-statistic: 39.98 on 6 and 359 DF, p-value: < 2.2e-16
To conduct the Breusch-Pagan test for heteroskedasticity, we use the bptest()
function from the lmtest
package:
bptest(est)
##
## studentized Breusch-Pagan test
##
## data: est
## BP = 44.557, df = 6, p-value = 5.732e-08
To do the White test, we simply modify the arguments in the bptest()
function:
bptest(est, ~ fitted(est) + I(fitted(est)^2) )
##
## studentized Breusch-Pagan test
##
## data: est
## BP = 1.2645, df = 2, p-value = 0.5314
Now let’s obtain standard errors from the above regression that are robust to heteroskedasticity. To do so, we use the lm_robust()
function from the estimatr
package. This function works like regular lm()
but instead reports a refined version of White’s robust standard errors.
est.rob <- lm_robust(cumgpa ~ sat + hsperc + tothrs + female + black + white, data=df)
summary(est.rob)
##
## Call:
## lm_robust(formula = cumgpa ~ sat + hsperc + tothrs + female +
## black + white, data = df)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 1.470065 0.2238293 6.5678 1.794e-10 1.0298834 1.910246 359
## sat 0.001141 0.0001925 5.9268 7.249e-09 0.0007622 0.001519 359
## hsperc -0.008566 0.0014237 -6.0172 4.380e-09 -0.0113661 -0.005767 359
## tothrs 0.002504 0.0007413 3.3776 8.112e-04 0.0010461 0.003962 359
## female 0.303433 0.0592945 5.1174 5.058e-07 0.1868250 0.420042 359
## black -0.128284 0.1230101 -1.0429 2.977e-01 -0.3701947 0.113627 359
## white -0.058722 0.1152462 -0.5095 6.107e-01 -0.2853642 0.167921 359
##
## Multiple R-squared: 0.4006 , Adjusted R-squared: 0.3905
## F-statistic: 39.17 on 6 and 359 DF, p-value: < 2.2e-16
tidy(est)
## # A tibble: 7 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.47 0.230 6.40 4.94e-10
## 2 sat 0.00114 0.000179 6.39 5.20e-10
## 3 hsperc -0.00857 0.00124 -6.91 2.27e-11
## 4 tothrs 0.00250 0.000731 3.43 6.85e- 4
## 5 female 0.303 0.0590 5.14 4.50e- 7
## 6 black -0.128 0.147 -0.870 3.85e- 1
## 7 white -0.0587 0.141 -0.416 6.77e- 1
tidy(est.rob)
## term estimate std.error statistic p.value conf.low
## 1 (Intercept) 1.470064766 0.2238293185 6.5677936 1.794270e-10 1.0298833856
## 2 sat 0.001140728 0.0001924686 5.9268256 7.248729e-09 0.0007622202
## 3 hsperc -0.008566358 0.0014236525 -6.0171689 4.379958e-09 -0.0113661039
## 4 tothrs 0.002503998 0.0007413439 3.3776471 8.112185e-04 0.0010460757
## 5 female 0.303433294 0.0592945229 5.1173916 5.057872e-07 0.1868250448
## 6 black -0.128283685 0.1230101308 -1.0428709 2.977098e-01 -0.3701946623
## 7 white -0.058721727 0.1152461964 -0.5095329 6.106918e-01 -0.2853641981
## conf.high df outcome
## 1 1.910246147 359 cumgpa
## 2 0.001519235 359 cumgpa
## 3 -0.005766611 359 cumgpa
## 4 0.003961921 359 cumgpa
## 5 0.420041544 359 cumgpa
## 6 0.113627293 359 cumgpa
## 7 0.167920744 359 cumgpa
Now look at the robust version of the overall F-test. Is its conclusion changed?
glance(est)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.401 0.391 0.469 40.0 3.41e-37 7 -239. 494. 525.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
linearHypothesis(est.rob, c('sat','hsperc','tothrs','female','black','white'))
## Linear hypothesis test
##
## Hypothesis:
## sat = 0
## hsperc = 0
## tothrs = 0
## female = 0
## black = 0
## white = 0
##
## Model 1: restricted model
## Model 2: cumgpa ~ sat + hsperc + tothrs + female + black + white
##
## Res.Df Df Chisq Pr(>Chisq)
## 1 365
## 2 359 6 235 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The LM test is an alternative to the overall F test that is reported in glimpse(est)
. To perform the LM test, we need to do the following:
# Restricted model
restr <- lm(cumgpa ~ 1, data=df)
LMreg <- lm(resid(restr) ~ sat + hsperc + tothrs + female + black + white, data=df)
LM <- nobs(LMreg)*glance(LMreg)$r.squared
pval <- 1-pchisq(LM,6)
pval
## [1] 0
9.3 ## References