The purpose of this in-class lab10 is to use R to practice computing weighted statistics and appropriately correcting for clustering in standard errors. The lab10 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

10.1 For starters

First, install the NHANES and clubSandwich packages. You won’t need the wooldridge package for this lab.

Open up a new R script (named ICL10_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

# Add names of group members HERE
library(tidyverse)
library(broom)
library(car)
library(lmtest)
library(magrittr)
library(NHANES)
library(clubSandwich)

10.1.1 Load the data

We’ll use a well-known health data set called the National Health and Nutrition Examination Survey (NHANES). The data set contains 78 variables detailing the demographics and health status of 20,293 Americans.

The NHANES is not a random sample of the US population. Instead, the survey oversamples certain demographic groups so that it can obtain more precise measurements of their health status.

df <- as_tibble(NHANESraw)

Check out what’s in the data by typing View(df) in the console. (Using glimpse() in this case is probably not a good idea because there are so many variables, but you should feel free to test it out and see if you find it useful.)

The main variables we’re interested in are: BMI, SleepHrsNight, Age, Education, Gender, Race1, and WTINT2YR (a variable indicating each person’s sampling weight). We also only want to look at observations only in the 2009-2010 survey wave.

10.1.2 Restrict to observations to the 2009-2010 survey wave and age 19+

Use a filter() statement to keep observations where SurveyYr equals 2009_10. Because SurveyYr is a factor, the code is a bit tricky, so I’ll put it below for your reference:1

df %<>% dplyr::filter(as.character(SurveyYr)=='2009_10',Age>=19)

10.1.3 Get rid of variables you won’t use

Use a select() statement to keep only the variables that will be used (refer to the list above; I won’t put the code here)

df %<>% select(BMI, SleepHrsNight, Age, Education, Gender, Race1, WTINT2YR)

10.1.4 Drop missing values

Finally, get rid of observations with missing BMI, Education, or Sleep:

df %<>% dplyr::filter(!is.na(Education),!is.na(BMI),!is.na(SleepHrsNight))

Look at the data to make sure the code worked as expected. You should now have 5,971 observations and 7 variables.

df %>% glimpse()
## Observations: 5,971
## Variables: 7
## $ BMI           <dbl> 32.22, 42.39, 32.61, 30.57, 26.04, 27.62, 39.90, 28.2...
## $ SleepHrsNight <int> 4, 4, 4, 8, 6, 9, 10, 7, 8, 8, 6, 7, 7, 5, 4, 4, 7, 4...
## $ Age           <int> 34, 60, 26, 49, 80, 80, 42, 66, 45, 28, 44, 66, 49, 5...
## $ Education     <fct> High School, High School, 9 - 11th Grade, Some Colleg...
## $ Gender        <fct> male, female, male, female, male, male, female, male,...
## $ Race1         <fct> White, Black, Mexican, White, White, White, Black, Me...
## $ WTINT2YR      <dbl> 80100.544, 20090.339, 22537.827, 74212.270, 11998.401...
df %>% DataExplorer::introduce() %>% as.data.frame()
##   rows columns discrete_columns continuous_columns all_missing_columns
## 1 5971       7                3                  4                   0
##   total_missing_values complete_rows total_observations memory_usage
## 1                    0          5971              41797       219288

10.2 Computing weighted summary stats

Suppose you are interested in the average BMI of the US population. You might be tempted to type

mean(df$BMI)
## [1] 29.16829

but you know that NHANES is not a random sample, and that there are sampling weights included in the data.

To compute the population average, you use the sampling weights like so:

weighted.mean(df$BMI, w=df$WTINT2YR)
## [1] 28.7547
  1. How different are your two answers? What is the relevant population, given that you deleted so many observations on the way to estimating the population mean?

10.2.1 Regression-adjusted summary stats

Suppose now you are interested in the male-female BMI differential, correcting for other factors (like education, race, and sleep). The easiest way to do this to estimate a regression model

\[ BMI = \beta_0 + \beta_1 male + \beta_2 race + \beta_3 sleep + u \]

est.unweighted <- lm(BMI ~ Gender + Race1 + SleepHrsNight, data=df)
summary(est.unweighted)
## 
## Call:
## lm(formula = BMI ~ Gender + Race1 + SleepHrsNight, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.800  -4.528  -0.973   3.245  53.566 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.72125    0.45506  71.906  < 2e-16 ***
## Gendermale    -0.79484    0.17513  -4.538 5.78e-06 ***
## Race1Hispanic -1.77041    0.34303  -5.161 2.53e-07 ***
## Race1Mexican  -0.97648    0.29135  -3.352 0.000808 ***
## Race1White    -2.00713    0.24328  -8.250  < 2e-16 ***
## Race1Other    -4.27006    0.42436 -10.062  < 2e-16 ***
## SleepHrsNight -0.23617    0.06124  -3.857 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.759 on 5964 degrees of freedom
## Multiple R-squared:  0.02813,    Adjusted R-squared:  0.02716 
## F-statistic: 28.78 on 6 and 5964 DF,  p-value: < 2.2e-16
  1. Interpret your estimate of the Gendermale coefficient.

Now add weights so that your estimate lines up with the true difference in BMI in the population:

est.weighted <- lm(BMI ~ Gender + Race1 + SleepHrsNight, weights=WTINT2YR, data=df)
summary(est.weighted)
## 
## Call:
## lm(formula = BMI ~ Gender + Race1 + SleepHrsNight, data = df, 
##     weights = WTINT2YR)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -4154.8  -746.2  -130.6   537.9  9993.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.63860    0.49800  65.540  < 2e-16 ***
## Gendermale    -0.08867    0.17210  -0.515 0.606407    
## Race1Hispanic -2.37087    0.46075  -5.146 2.75e-07 ***
## Race1Mexican  -1.47100    0.38975  -3.774 0.000162 ***
## Race1White    -2.46040    0.27671  -8.892  < 2e-16 ***
## Race1Other    -4.82287    0.41138 -11.724  < 2e-16 ***
## SleepHrsNight -0.22995    0.06471  -3.554 0.000383 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1249 on 5964 degrees of freedom
## Multiple R-squared:  0.02812,    Adjusted R-squared:  0.02714 
## F-statistic: 28.76 on 6 and 5964 DF,  p-value: < 2.2e-16
  1. How does your answer change when you supply weights? What do you now conclude about the population gender differential in BMI?

  2. What do you notice about differentials in race and amount of sleep?

10.3 Inference with Cluster-Robust Standard Errors

Now let’s obtain standard errors from a different data set and regression model. that are robust to heteroskedasticity. To do so, we use the coef_test() function from the clubSandwich package.

10.3.1 Load new data

First load the data, which is a CSV file from my website:

df.auto <- read_csv('https://tyleransom.github.io/teaching/MetricsLabs/auto.csv')

The data set contains specifications for 74 different makes of automobiles. Estimate the following regression model:

\[ \log(price) = \beta_0 + \beta_1 weight + \beta_2 foreign + u \]

df.auto %<>% mutate(log.price = log(price), 
                    foreign = as.factor(foreign))

est.auto <- lm(log.price ~ weight + foreign, data=df.auto)

Regular standard errors:

tidy(est.auto)
## # A tibble: 3 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    7.09     0.170         41.7  1.07e-51
## 2 weight         0.000461 0.0000500      9.21 9.44e-14
## 3 foreignForeign 0.535    0.0844         6.34 1.85e- 8

Now use the heteroskedasticty-robust SEs from last lab:

est.auto.robust <- estimatr::lm_robust(log.price ~ weight + foreign, data=df.auto)

summary(est.auto.robust)
## 
## Call:
## estimatr::lm_robust(formula = log.price ~ weight + foreign, data = df.auto)
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                 Estimate Std. Error t value  Pr(>|t|) CI Lower  CI Upper DF
## (Intercept)    7.0908575  2.035e-01  34.837 2.215e-46 6.684997 7.4967180 71
## weight         0.0004606  6.096e-05   7.556 1.112e-10 0.000339 0.0005821 71
## foreignForeign 0.5352669  8.733e-02   6.129 4.435e-08 0.361136 0.7093979 71
## 
## Multiple R-squared:  0.548 , Adjusted R-squared:  0.5353 
## F-statistic:  29.2 on 2 and 71 DF,  p-value: 5.582e-10

or:

tidy(coeftest(est.auto, vcov=hccm))
## # A tibble: 3 x 5
##   term           estimate std.error statistic  p.value
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    7.09     0.212         33.5  3.08e-45
## 2 weight         0.000461 0.0000632      7.28 3.54e-10
## 3 foreignForeign 0.535    0.0906         5.91 1.10e- 7

Now use the cluster-robust SEs:

coef_test(est.auto, vcov="CR1", cluster=df.auto$manufacturer)
##            Coef. Estimate       SE t-stat d.f. p-val (Satt) Sig.
## 1    (Intercept) 7.090857 2.68e-01  26.46 10.5       <0.001  ***
## 2         weight 0.000461 8.34e-05   5.52 10.8       <0.001  ***
## 3 foreignForeign 0.535267 1.05e-01   5.10 15.2       <0.001  ***

Notice that the SEs on each of the coefficients get bigger with each additional robustness option. The reason for this is that price is correlated within auto manufacturer (due to branding effects).

Finally, you can do an F-test as follows:

Wald_test(est.auto, c("weight","foreignForeign"), 
          vcov="CR1", 
          cluster=df.auto$manufacturer)
##  Test    F d.f.  p.val
##   HTZ 15.7 13.4 <0.001

10.4 References


  1. The trick is to convert the factor to a string variable so that you are able to match the label of the factor. Similarly, if the lables of the factor are integers, you should use as.numeric(SurveyYr)==2009 in the filter() statement.