The purpose of this in-class lab4 is to further practice your regression skills. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL4_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(skimr)
For this lab4, let’s use data on house prices. This is located in the hprice1
data set in the wooldridge
package. Each observation is a house.
df <- as_tibble(hprice1)
skim(df)
Name | df |
Number of rows | 88 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
numeric | 10 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
price | 0 | 1 | 293.55 | 102.71 | 111.00 | 230.00 | 265.50 | 326.25 | 725.00 | ▃▇▂▁▁ |
assess | 0 | 1 | 315.74 | 95.31 | 198.70 | 253.90 | 290.20 | 352.12 | 708.60 | ▇▃▁▁▁ |
bdrms | 0 | 1 | 3.57 | 0.84 | 2.00 | 3.00 | 3.00 | 4.00 | 7.00 | ▇▆▁▁▁ |
lotsize | 0 | 1 | 9019.86 | 10174.15 | 1000.00 | 5732.75 | 6430.00 | 8583.25 | 92681.00 | ▇▁▁▁▁ |
sqrft | 0 | 1 | 2013.69 | 577.19 | 1171.00 | 1660.50 | 1845.00 | 2227.00 | 3880.00 | ▆▇▃▁▁ |
colonial | 0 | 1 | 0.69 | 0.46 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
lprice | 0 | 1 | 5.63 | 0.30 | 4.71 | 5.44 | 5.58 | 5.79 | 6.59 | ▁▅▇▂▁ |
lassess | 0 | 1 | 5.72 | 0.26 | 5.29 | 5.54 | 5.67 | 5.86 | 6.56 | ▅▇▃▂▁ |
llotsize | 0 | 1 | 8.91 | 0.54 | 6.91 | 8.65 | 8.77 | 9.06 | 11.44 | ▁▆▇▁▁ |
lsqrft | 0 | 1 | 7.57 | 0.26 | 7.07 | 7.41 | 7.52 | 7.71 | 8.26 | ▂▇▅▂▁ |
Check out what’s in df
by typing
glimpse(hprice1)
## Observations: 88
## Variables: 10
## $ price <dbl> 300.000, 370.000, 191.000, 195.000, 373.000, 466.275, 332....
## $ assess <dbl> 349.1, 351.5, 217.7, 231.8, 319.1, 414.5, 367.8, 300.2, 23...
## $ bdrms <int> 4, 3, 3, 3, 4, 5, 3, 3, 3, 3, 4, 5, 3, 3, 3, 4, 4, 3, 3, 4...
## $ lotsize <dbl> 6126, 9903, 5200, 4600, 6095, 8566, 9000, 6210, 6000, 2892...
## $ sqrft <int> 2438, 2076, 1374, 1448, 2514, 2754, 2067, 1731, 1767, 1890...
## $ colonial <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1...
## $ lprice <dbl> 5.703783, 5.913503, 5.252274, 5.273000, 5.921578, 6.144775...
## $ lassess <dbl> 5.855359, 5.862210, 5.383118, 5.445875, 5.765504, 6.027073...
## $ llotsize <dbl> 8.720297, 9.200593, 8.556414, 8.433811, 8.715224, 9.055556...
## $ lsqrft <dbl> 7.798934, 7.638198, 7.225482, 7.277938, 7.829630, 7.920810...
Let’s estimate the following regression model:
\[ price = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + u \]
where \(price\) is the house price in thousands of dollars.
The code to do so is:
est <- lm(price ~ sqrft + bdrms, data=df)
tidy(est)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -19.3 31.0 -0.622 5.36e- 1
## 2 sqrft 0.128 0.0138 9.29 1.39e-14
## 3 bdrms 15.2 9.48 1.60 1.13e- 1
glance(est)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.632 0.623 63.0 73.0 3.57e-19 3 -488. 984. 994.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
summary(est)
##
## Call:
## lm(formula = price ~ sqrft + bdrms, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -127.627 -42.876 -7.051 32.589 229.003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -19.31500 31.04662 -0.622 0.536
## sqrft 0.12844 0.01382 9.291 1.39e-14 ***
## bdrms 15.19819 9.48352 1.603 0.113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 63.04 on 85 degrees of freedom
## Multiple R-squared: 0.6319, Adjusted R-squared: 0.6233
## F-statistic: 72.96 on 2 and 85 DF, p-value: < 2.2e-16
You should get a coefficient of 0.128
on sqrft
and 15.2
on bdrms
. Interpret these coefficients. (You can type the interpretation as a comment in your .R script.) Do these numbers seem reasonable?
You should get \(R^2 = 0.632\). Based on that number, do you think this is a good model of house prices?
Check that the average of the residuals is zero:
mean(est$residuals)
## [1] -6.484712e-16
The previous regression model had an estimated intercept of -19.3
, meaning that a home with no bedrooms and no square footage would be expected to have a sales price of -$19,300.
To fix this, let’s instead use \(log(price)\) as the dependent variable, and let’s also add quadratic terms for sqrft
and bdrms
.
First, let’s use mutate()
to add these new variables:
df <- df %>% mutate(logprice = log(price), sqrftSq = sqrft^2, bdrmSq = bdrms^2)
Now run the new model:
est <- lm(logprice ~ sqrft + sqrftSq + bdrms + bdrmSq, data=df)
tidy(est)
## # A tibble: 5 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.07e+ 0 0.325 15.6 2.19e-26
## 2 sqrft 3.74e- 4 0.000247 1.51 1.34e- 1
## 3 sqrftSq 7.10e-10 0.0000000508 0.0140 9.89e- 1
## 4 bdrms -1.30e- 1 0.145 -0.898 3.72e- 1
## 5 bdrmSq 1.99e- 2 0.0178 1.12 2.67e- 1
glance(est)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.595 0.576 0.198 30.5 1.28e-15 5 20.3 -28.7 -13.8
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
summary(est)
##
## Call:
## lm(formula = logprice ~ sqrft + sqrftSq + bdrms + bdrmSq, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73800 -0.11292 -0.01387 0.10889 0.62609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.074e+00 3.254e-01 15.592 <2e-16 ***
## sqrft 3.742e-04 2.474e-04 1.512 0.134
## sqrftSq 7.098e-10 5.082e-08 0.014 0.989
## bdrms -1.302e-01 1.450e-01 -0.898 0.372
## bdrmSq 1.990e-02 1.780e-02 1.118 0.267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1977 on 83 degrees of freedom
## Multiple R-squared: 0.5953, Adjusted R-squared: 0.5758
## F-statistic: 30.52 on 4 and 83 DF, p-value: 1.276e-15
The new coefficients have much smaller magnitudes. Explain why that might be.
The new \(R^2 = 0.595\) which is less than \(0.632\) from before. Does that mean this model is worse?
Let’s experiment with the Frisch-Waugh Theorem, which says:
\[ \hat{\beta}_{1} = \frac{\sum_{i=1}^{N} \hat{r}_{i1}y_{i}}{\sum_{i=1}^{N} \hat{r}_{i1}^2} \]
where \(\hat{r}_{i1}\) is the residual from a regression of \(x_{1}\) on \(x_{2},\ldots,x_{k}\)
Let’s do this for the model we just ran. First, regress sqrft
on the other \(X\)’s and store the residuals as a new column in df
.
est <- lm(sqrft ~ sqrftSq + bdrms + bdrmSq, data=df)
df <- df %>% mutate(sqrft.resid =
est$residuals)
Now, if we run a simple regression of logprice
on sqrft.resid
we should get the same coefficient as that of sqrft
in the original regression (=3.74e-4
).
est <- lm(logprice ~ sqrft.resid, data=df)
tidy(est)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 5.63 0.0324 174. 2.33e-111
## 2 sqrft.resid 0.000374 0.000380 0.985 3.28e- 1
est %>% summary()
##
## Call:
## lm(formula = logprice ~ sqrft.resid, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8888 -0.1714 -0.0728 0.1409 1.0217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.6331801 0.0323666 174.043 <2e-16 ***
## sqrft.resid 0.0003742 0.0003800 0.985 0.328
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3036 on 86 degrees of freedom
## Multiple R-squared: 0.01115, Adjusted R-squared: -0.0003501
## F-statistic: 0.9696 on 1 and 86 DF, p-value: 0.3276
We can also compute the Frisch-Waugh formula by hand:
beta1 <- sum(df$sqrft.resid*df$logprice)/sum(df$sqrft.resid^2)
print(beta1)
## [1] 0.0003741526
Which indeed gives us what we expected.