The purpose of this in-class lab10 is to use R to practice computing weighted statistics and appropriately correcting for clustering in standard errors. The lab10 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
First, install the NHANES
and clubSandwich
packages. You won’t need the wooldridge
package for this lab.
Open up a new R script (named ICL10_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(broom)
library(car)
library(lmtest)
library(magrittr)
library(NHANES)
library(clubSandwich)
We’ll use a well-known health data set called the National Health and Nutrition Examination Survey (NHANES). The data set contains 78 variables detailing the demographics and health status of 20,293 Americans.
The NHANES is not a random sample of the US population. Instead, the survey oversamples certain demographic groups so that it can obtain more precise measurements of their health status.
df <- as_tibble(NHANESraw)
Check out what’s in the data by typing View(df)
in the console. (Using glimpse()
in this case is probably not a good idea because there are so many variables, but you should feel free to test it out and see if you find it useful.)
The main variables we’re interested in are: BMI, SleepHrsNight, Age, Education, Gender, Race1, and WTINT2YR (a variable indicating each person’s sampling weight). We also only want to look at observations only in the 2009-2010 survey wave.
Use a filter()
statement to keep observations where SurveyYr
equals 2009_10
. Because SurveyYr
is a factor, the code is a bit tricky, so I’ll put it below for your reference:1
df %<>% dplyr::filter(as.character(SurveyYr)=='2009_10',Age>=19)
Use a select()
statement to keep only the variables that will be used (refer to the list above; I won’t put the code here)
df %<>% select(BMI, SleepHrsNight, Age, Education, Gender, Race1, WTINT2YR)
Finally, get rid of observations with missing BMI, Education, or Sleep:
df %<>% dplyr::filter(!is.na(Education),!is.na(BMI),!is.na(SleepHrsNight))
Look at the data to make sure the code worked as expected. You should now have 5,971 observations and 7 variables.
df %>% glimpse()
## Observations: 5,971
## Variables: 7
## $ BMI <dbl> 32.22, 42.39, 32.61, 30.57, 26.04, 27.62, 39.90, 28.2...
## $ SleepHrsNight <int> 4, 4, 4, 8, 6, 9, 10, 7, 8, 8, 6, 7, 7, 5, 4, 4, 7, 4...
## $ Age <int> 34, 60, 26, 49, 80, 80, 42, 66, 45, 28, 44, 66, 49, 5...
## $ Education <fct> High School, High School, 9 - 11th Grade, Some Colleg...
## $ Gender <fct> male, female, male, female, male, male, female, male,...
## $ Race1 <fct> White, Black, Mexican, White, White, White, Black, Me...
## $ WTINT2YR <dbl> 80100.544, 20090.339, 22537.827, 74212.270, 11998.401...
df %>% DataExplorer::introduce() %>% as.data.frame()
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 5971 7 3 4 0
## total_missing_values complete_rows total_observations memory_usage
## 1 0 5971 41797 219288
Suppose you are interested in the average BMI of the US population. You might be tempted to type
mean(df$BMI)
## [1] 29.16829
but you know that NHANES is not a random sample, and that there are sampling weights included in the data.
To compute the population average, you use the sampling weights like so:
weighted.mean(df$BMI, w=df$WTINT2YR)
## [1] 28.7547
Suppose now you are interested in the male-female BMI differential, correcting for other factors (like education, race, and sleep). The easiest way to do this to estimate a regression model
\[ BMI = \beta_0 + \beta_1 male + \beta_2 race + \beta_3 sleep + u \]
est.unweighted <- lm(BMI ~ Gender + Race1 + SleepHrsNight, data=df)
summary(est.unweighted)
##
## Call:
## lm(formula = BMI ~ Gender + Race1 + SleepHrsNight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.800 -4.528 -0.973 3.245 53.566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.72125 0.45506 71.906 < 2e-16 ***
## Gendermale -0.79484 0.17513 -4.538 5.78e-06 ***
## Race1Hispanic -1.77041 0.34303 -5.161 2.53e-07 ***
## Race1Mexican -0.97648 0.29135 -3.352 0.000808 ***
## Race1White -2.00713 0.24328 -8.250 < 2e-16 ***
## Race1Other -4.27006 0.42436 -10.062 < 2e-16 ***
## SleepHrsNight -0.23617 0.06124 -3.857 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.759 on 5964 degrees of freedom
## Multiple R-squared: 0.02813, Adjusted R-squared: 0.02716
## F-statistic: 28.78 on 6 and 5964 DF, p-value: < 2.2e-16
Gendermale
coefficient.Now add weights so that your estimate lines up with the true difference in BMI in the population:
est.weighted <- lm(BMI ~ Gender + Race1 + SleepHrsNight, weights=WTINT2YR, data=df)
summary(est.weighted)
##
## Call:
## lm(formula = BMI ~ Gender + Race1 + SleepHrsNight, data = df,
## weights = WTINT2YR)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -4154.8 -746.2 -130.6 537.9 9993.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.63860 0.49800 65.540 < 2e-16 ***
## Gendermale -0.08867 0.17210 -0.515 0.606407
## Race1Hispanic -2.37087 0.46075 -5.146 2.75e-07 ***
## Race1Mexican -1.47100 0.38975 -3.774 0.000162 ***
## Race1White -2.46040 0.27671 -8.892 < 2e-16 ***
## Race1Other -4.82287 0.41138 -11.724 < 2e-16 ***
## SleepHrsNight -0.22995 0.06471 -3.554 0.000383 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1249 on 5964 degrees of freedom
## Multiple R-squared: 0.02812, Adjusted R-squared: 0.02714
## F-statistic: 28.76 on 6 and 5964 DF, p-value: < 2.2e-16
How does your answer change when you supply weights? What do you now conclude about the population gender differential in BMI?
What do you notice about differentials in race and amount of sleep?
Now let’s obtain standard errors from a different data set and regression model. that are robust to heteroskedasticity. To do so, we use the coef_test()
function from the clubSandwich
package.
First load the data, which is a CSV file from my website:
df.auto <- read_csv('https://tyleransom.github.io/teaching/MetricsLabs/auto.csv')
The data set contains specifications for 74 different makes of automobiles. Estimate the following regression model:
\[ \log(price) = \beta_0 + \beta_1 weight + \beta_2 foreign + u \]
df.auto %<>% mutate(log.price = log(price),
foreign = as.factor(foreign))
est.auto <- lm(log.price ~ weight + foreign, data=df.auto)
Regular standard errors:
tidy(est.auto)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.09 0.170 41.7 1.07e-51
## 2 weight 0.000461 0.0000500 9.21 9.44e-14
## 3 foreignForeign 0.535 0.0844 6.34 1.85e- 8
Now use the heteroskedasticty-robust SEs from last lab:
est.auto.robust <- estimatr::lm_robust(log.price ~ weight + foreign, data=df.auto)
summary(est.auto.robust)
##
## Call:
## estimatr::lm_robust(formula = log.price ~ weight + foreign, data = df.auto)
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 7.0908575 2.035e-01 34.837 2.215e-46 6.684997 7.4967180 71
## weight 0.0004606 6.096e-05 7.556 1.112e-10 0.000339 0.0005821 71
## foreignForeign 0.5352669 8.733e-02 6.129 4.435e-08 0.361136 0.7093979 71
##
## Multiple R-squared: 0.548 , Adjusted R-squared: 0.5353
## F-statistic: 29.2 on 2 and 71 DF, p-value: 5.582e-10
or:
tidy(coeftest(est.auto, vcov=hccm))
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.09 0.212 33.5 3.08e-45
## 2 weight 0.000461 0.0000632 7.28 3.54e-10
## 3 foreignForeign 0.535 0.0906 5.91 1.10e- 7
Now use the cluster-robust SEs:
coef_test(est.auto, vcov="CR1", cluster=df.auto$manufacturer)
## Coef. Estimate SE t-stat d.f. p-val (Satt) Sig.
## 1 (Intercept) 7.090857 2.68e-01 26.46 10.5 <0.001 ***
## 2 weight 0.000461 8.34e-05 5.52 10.8 <0.001 ***
## 3 foreignForeign 0.535267 1.05e-01 5.10 15.2 <0.001 ***
Notice that the SEs on each of the coefficients get bigger with each additional robustness option. The reason for this is that price is correlated within auto manufacturer (due to branding effects).
Finally, you can do an F-test as follows:
Wald_test(est.auto, c("weight","foreignForeign"),
vcov="CR1",
cluster=df.auto$manufacturer)
## Test F d.f. p.val
## HTZ 15.7 13.4 <0.001
The trick is to convert the factor to a string variable so that you are able to match the label of the factor. Similarly, if the lables of the factor are integers, you should use as.numeric(SurveyYr)==2009
in the filter()
statement.↩