3二元线性回归-非线性转换、可视化和公式

The purpose of this in-class lab3 is to practice running regressions, computing regression formulas, visualizing the Sample Regression Function, using non-linear transformations, and interpreting coefficients. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

3.1 For starters

Open up a new R script (named ICL3_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

library(tidyverse)
library(broom)
library(wooldridge)

For this lab3, let’s use data on school expenditures and math test pass rates from Michigan. This is located in the meap93 data set in the wooldridge package. Each observation is a school district in Michigan.

df <- as_tibble(meap93)
glimpse(df)

## Observations: 408
## Variables: 17
## $ lnchprg  <dbl> 1.4, 2.3, 2.7, 3.4, 3.4, 3.4, 3.6, 3.6, 4.2, 4.2, 4.5, 4.5...
## $ enroll   <int> 1862, 11355, 7685, 1148, 1572, 2496, 3358, 11983, 3499, 50...
## $ staff    <dbl> 112.6, 101.2, 114.0, 85.4, 96.1, 101.1, 87.6, 96.0, 102.8,...
## $ expend   <int> 5765, 6601, 6834, 3586, 3847, 5070, 4474, 5159, 5012, 6501...
## $ salary   <dbl> 37498, 48722, 44541, 31566, 29781, 36801, 37863, 40133, 36...
## $ benefits <int> 7420, 10370, 7313, 5989, 5545, 5895, 6934, 8085, 7253, 738...
## $ droprate <dbl> 2.9, 1.3, 3.5, 3.6, 0.0, 2.7, 1.4, 2.5, 1.8, 1.9, 3.0, 5.6...
## $ gradrate <dbl> 89.2, 91.4, 91.4, 86.6, 100.0, 89.2, 95.1, 87.8, 93.2, 92....
## $ math10   <dbl> 56.4, 42.7, 43.8, 25.3, 15.3, 46.0, 33.6, 40.1, 42.1, 39.8...
## $ sci11    <dbl> 67.9, 65.3, 54.3, 60.0, 65.8, 60.5, 67.4, 69.4, 71.7, 55.0...
## $ totcomp  <dbl> 44918, 59092, 51854, 37555, 35326, 42696, 44797, 48218, 43...
## $ ltotcomp <dbl> 10.71259, 10.98685, 10.85619, 10.53356, 10.47237, 10.66186...
## $ lexpend  <dbl> 8.659560, 8.794976, 8.829665, 8.184793, 8.255049, 8.531096...
## $ lenroll  <dbl> 7.529407, 9.337414, 8.947025, 7.045776, 7.360104, 7.822445...
## $ lstaff   <dbl> 4.723842, 4.617099, 4.736198, 4.447346, 4.565389, 4.616110...
## $ bensal   <dbl> 0.1978772134, 0.2128401995, 0.1641858071, 0.1897294521, 0....
## $ lsalary  <dbl> 10.53204, 10.79389, 10.70417, 10.35984, 10.30163, 10.51328...

3.2 The Relationship between Expenditures and Math Test Pass Rates

Estimate the following regression model:

\[ math10 = \beta_0 + \beta_1 expend + u \]

The code to do so is:

est <- lm(math10 ~ expend, data=df)
tidy(est)

## # A tibble: 2 x 5
##   term        estimate std.error statistic    p.value
##   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept) 13.4      2.93          4.55 0.00000700
## 2 expend       0.00246  0.000660      3.72 0.000227

glance(est)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0330        0.0306  10.3      13.8 2.27e-4     2 -1531. 3067. 3079.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

summary(est)

## 
## Call:
## lm(formula = math10 ~ expend, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.579  -7.175  -0.874   6.299  39.174 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.336e+01  2.934e+00   4.553    7e-06 ***
## expend      2.456e-03  6.601e-04   3.720 0.000227 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.33 on 406 degrees of freedom
## Multiple R-squared:  0.03296,    Adjusted R-squared:  0.03058 
## F-statistic: 13.84 on 1 and 406 DF,  p-value: 0.0002273

You should get a coefficient of 0.00246 on expend. Interpret this coefficient. (You can type the interpretation as a comment in your .R script.) Is this number small, given the units that math10 and expend are in?

3.3 Regression Coefficients “By Hand”

Verify that the regression coefficients in est are the same as the formulas from the book:

\[ \hat{\beta}_0 = \overline{math10} - \hat{\beta}_1 \overline{expend}, \\ \hat{\beta}_1 = \frac{\widehat{cov}(math10,expend)}{\widehat{var}(expend)} \]

You can do this by typing:

beta1 <- cov(df$math10,df$expend)/var(df$expend)
beta0 <- mean(df$math10)-beta1*mean(df$expend)
beta1

## [1] 0.002455715

beta0

## [1] 13.35923

3.4 Visualizing Regression Estimates

Often, it’s helpful to visualize the estimated regression model. @wooldridge calls this the “Sample Regression Function.” We can do this with the following code:

ggplot(df,aes(expend,math10)) +
    geom_point() +
    geom_smooth(method='lm',size = 2) +
  scale_x_continuous(limits = c(3000,7500),breaks = seq(3000,7500,500)) +
  scale_y_continuous(limits = c(0,70),breaks = seq(0,70,10))

## `geom_smooth()` using formula 'y ~ x'

3.5 Nonlinear transformations

Let’s consider a modified version of our model, where now we use log expenditures instead of expenditures. Why might we want to use log expenditures? Likely because we think that each additional dollar spent doesn’t have an equal effect on pass rates. That is, additional dollars spent likely have diminishing effects on pass rates. (See also: the Law of Diminishing Marginal Returns)

Create the log expenditures variable using mutate():

df <- df %>% 
  mutate(logexpend = log(expend))

Now estimate your model again and re-do the visualization:

est <- lm(math10 ~ logexpend, data=df)
tidy(est)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    -69.3     26.5      -2.61 0.00929 
## 2 logexpend       11.2      3.17      3.52 0.000475

glance(est)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0297        0.0273  10.3      12.4 4.75e-4     2 -1531. 3069. 3081.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

summary(est)

## 
## Call:
## lm(formula = math10 ~ logexpend, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.343  -7.100  -0.914   6.148  39.093 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -69.341     26.530  -2.614 0.009290 ** 
## logexpend     11.164      3.169   3.523 0.000475 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 406 degrees of freedom
## Multiple R-squared:  0.02966,    Adjusted R-squared:  0.02727 
## F-statistic: 12.41 on 1 and 406 DF,  p-value: 0.0004752

ggplot(df,aes(logexpend,math10)) +
    geom_point() +
    geom_smooth(method='lm')

## `geom_smooth()` using formula 'y ~ x'

What is the interpretation of \(\beta_1\) in this new model? (Add it as a comment in your R script)

3.6 Standard Errors and Regression Output

Finally, we can look at the standard error, t-statistic, and p-values associated with our regression parameters \(\beta_0\) and \(\beta_1\). The p.value reported in tidy(est) tests the following hypothesis:

\[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \]

Does increased school spending significantly increase the math test pass rate?

3.7 Computing standard errors by hand

If you have extra time, try computing the standard error formulas by hand, according to the formulas in the text book. To do so, we need to compute the following formulas: sig (the standard deviation of \(u\)), n (our regression’s sample size), SSTx (\(N-1\) times the variance of logexpend), and the sum of the squares of logexpend:

n <- dim(df)[1]
n

## [1] 408

sig <- sqrt(sum(est$residuals^2)/(n-2) ) # or, more simply, glance(est)$sigma
sig

## [1] 10.34953

SSTx <- (n-1)*var(df$logexpend)
sumx2 <- sum(df$logexpend^2)

The standard error of the intercept is computed with the following formula:

sqrt((sig^2*(1/n)*sumx2)/SSTx)

## [1] 26.53013

And the standard error of the slope coefficient (logexpend in this case) is:

sqrt(sig^2/SSTx)

## [1] 3.169011

3.8 References

Wooldridge, Jeffrey M. 2015. Introductory Econometrics: A Modern Approach. 6th ed. Cengage Learning.