The purpose of this in-class lab3 is to practice running regressions, computing regression formulas, visualizing the Sample Regression Function, using non-linear transformations, and interpreting coefficients. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL3_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
library(tidyverse)
library(broom)
library(wooldridge)
For this lab3, let’s use data on school expenditures and math test pass rates from Michigan. This is located in the meap93
data set in the wooldridge
package. Each observation is a school district in Michigan.
df <- as_tibble(meap93)
glimpse(df)
## Observations: 408
## Variables: 17
## $ lnchprg <dbl> 1.4, 2.3, 2.7, 3.4, 3.4, 3.4, 3.6, 3.6, 4.2, 4.2, 4.5, 4.5...
## $ enroll <int> 1862, 11355, 7685, 1148, 1572, 2496, 3358, 11983, 3499, 50...
## $ staff <dbl> 112.6, 101.2, 114.0, 85.4, 96.1, 101.1, 87.6, 96.0, 102.8,...
## $ expend <int> 5765, 6601, 6834, 3586, 3847, 5070, 4474, 5159, 5012, 6501...
## $ salary <dbl> 37498, 48722, 44541, 31566, 29781, 36801, 37863, 40133, 36...
## $ benefits <int> 7420, 10370, 7313, 5989, 5545, 5895, 6934, 8085, 7253, 738...
## $ droprate <dbl> 2.9, 1.3, 3.5, 3.6, 0.0, 2.7, 1.4, 2.5, 1.8, 1.9, 3.0, 5.6...
## $ gradrate <dbl> 89.2, 91.4, 91.4, 86.6, 100.0, 89.2, 95.1, 87.8, 93.2, 92....
## $ math10 <dbl> 56.4, 42.7, 43.8, 25.3, 15.3, 46.0, 33.6, 40.1, 42.1, 39.8...
## $ sci11 <dbl> 67.9, 65.3, 54.3, 60.0, 65.8, 60.5, 67.4, 69.4, 71.7, 55.0...
## $ totcomp <dbl> 44918, 59092, 51854, 37555, 35326, 42696, 44797, 48218, 43...
## $ ltotcomp <dbl> 10.71259, 10.98685, 10.85619, 10.53356, 10.47237, 10.66186...
## $ lexpend <dbl> 8.659560, 8.794976, 8.829665, 8.184793, 8.255049, 8.531096...
## $ lenroll <dbl> 7.529407, 9.337414, 8.947025, 7.045776, 7.360104, 7.822445...
## $ lstaff <dbl> 4.723842, 4.617099, 4.736198, 4.447346, 4.565389, 4.616110...
## $ bensal <dbl> 0.1978772134, 0.2128401995, 0.1641858071, 0.1897294521, 0....
## $ lsalary <dbl> 10.53204, 10.79389, 10.70417, 10.35984, 10.30163, 10.51328...
Estimate the following regression model:
\[ math10 = \beta_0 + \beta_1 expend + u \]
The code to do so is:
est <- lm(math10 ~ expend, data=df)
tidy(est)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 13.4 2.93 4.55 0.00000700
## 2 expend 0.00246 0.000660 3.72 0.000227
glance(est)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.0330 0.0306 10.3 13.8 2.27e-4 2 -1531. 3067. 3079.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
summary(est)
##
## Call:
## lm(formula = math10 ~ expend, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.579 -7.175 -0.874 6.299 39.174
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.336e+01 2.934e+00 4.553 7e-06 ***
## expend 2.456e-03 6.601e-04 3.720 0.000227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.33 on 406 degrees of freedom
## Multiple R-squared: 0.03296, Adjusted R-squared: 0.03058
## F-statistic: 13.84 on 1 and 406 DF, p-value: 0.0002273
You should get a coefficient of 0.00246
on expend
. Interpret this coefficient. (You can type the interpretation as a comment in your .R script.) Is this number small, given the units that math10
and expend
are in?
Verify that the regression coefficients in est
are the same as the formulas from the book:
\[ \hat{\beta}_0 = \overline{math10} - \hat{\beta}_1 \overline{expend}, \\ \hat{\beta}_1 = \frac{\widehat{cov}(math10,expend)}{\widehat{var}(expend)} \]
You can do this by typing:
beta1 <- cov(df$math10,df$expend)/var(df$expend)
beta0 <- mean(df$math10)-beta1*mean(df$expend)
beta1
## [1] 0.002455715
beta0
## [1] 13.35923
Often, it’s helpful to visualize the estimated regression model. @wooldridge calls this the “Sample Regression Function.” We can do this with the following code:
ggplot(df,aes(expend,math10)) +
geom_point() +
geom_smooth(method='lm',size = 2) +
scale_x_continuous(limits = c(3000,7500),breaks = seq(3000,7500,500)) +
scale_y_continuous(limits = c(0,70),breaks = seq(0,70,10))
## `geom_smooth()` using formula 'y ~ x'
Let’s consider a modified version of our model, where now we use log expenditures instead of expenditures. Why might we want to use log expenditures? Likely because we think that each additional dollar spent doesn’t have an equal effect on pass rates. That is, additional dollars spent likely have diminishing effects on pass rates. (See also: the Law of Diminishing Marginal Returns)
Create the log expenditures variable using mutate()
:
df <- df %>%
mutate(logexpend = log(expend))
Now estimate your model again and re-do the visualization:
est <- lm(math10 ~ logexpend, data=df)
tidy(est)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -69.3 26.5 -2.61 0.00929
## 2 logexpend 11.2 3.17 3.52 0.000475
glance(est)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.0297 0.0273 10.3 12.4 4.75e-4 2 -1531. 3069. 3081.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
summary(est)
##
## Call:
## lm(formula = math10 ~ logexpend, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.343 -7.100 -0.914 6.148 39.093
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -69.341 26.530 -2.614 0.009290 **
## logexpend 11.164 3.169 3.523 0.000475 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.35 on 406 degrees of freedom
## Multiple R-squared: 0.02966, Adjusted R-squared: 0.02727
## F-statistic: 12.41 on 1 and 406 DF, p-value: 0.0004752
ggplot(df,aes(logexpend,math10)) +
geom_point() +
geom_smooth(method='lm')
## `geom_smooth()` using formula 'y ~ x'
What is the interpretation of \(\beta_1\) in this new model? (Add it as a comment in your R script)
Finally, we can look at the standard error, t-statistic, and p-values associated with our regression parameters \(\beta_0\) and \(\beta_1\). The p.value
reported in tidy(est)
tests the following hypothesis:
\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]
Does increased school spending significantly increase the math test pass rate?
If you have extra time, try computing the standard error formulas by hand, according to the formulas in the text book. To do so, we need to compute the following formulas: sig
(the standard deviation of \(u\)), n
(our regression’s sample size), SSTx
(\(N-1\) times the variance of logexpend
), and the sum of the squares of logexpend
:
n <- dim(df)[1]
n
## [1] 408
sig <- sqrt(sum(est$residuals^2)/(n-2) ) # or, more simply, glance(est)$sigma
sig
## [1] 10.34953
SSTx <- (n-1)*var(df$logexpend)
sumx2 <- sum(df$logexpend^2)
The standard error of the intercept is computed with the following formula:
sqrt((sig^2*(1/n)*sumx2)/SSTx)
## [1] 26.53013
And the standard error of the slope coefficient (logexpend
in this case) is:
sqrt(sig^2/SSTx)
## [1] 3.169011
Wooldridge, Jeffrey M. 2015. Introductory Econometrics: A Modern Approach. 6th ed. Cengage Learning.