4多元线性回归

The purpose of this in-class lab4 is to further practice your regression skills. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

2.1 For starters

Open up a new R script (named ICL4_XYZ.R, where XYZ are your initials) and add the usual “preamble” to the top:

# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(skimr)

For this lab4, let’s use data on house prices. This is located in the hprice1 data set in the wooldridge package. Each observation is a house.

df <- as_tibble(hprice1)
skim(df)

Data summary
Name	df
Number of rows	88
Number of columns	10
_______________________
Column type frequency:
numeric	10
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
price	1	293.55	102.71	111.00	230.00	265.50	326.25	725.00	▃▇▂▁▁
assess	1	315.74	95.31	198.70	253.90	290.20	352.12	708.60	▇▃▁▁▁
bdrms	1	3.57	0.84	2.00	3.00	3.00	4.00	7.00	▇▆▁▁▁
lotsize	1	9019.86	10174.15	1000.00	5732.75	6430.00	8583.25	92681.00	▇▁▁▁▁
sqrft	1	2013.69	577.19	1171.00	1660.50	1845.00	2227.00	3880.00	▆▇▃▁▁
colonial	1	0.69	0.46	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
lprice	1	5.63	0.30	4.71	5.44	5.58	5.79	6.59	▁▅▇▂▁
lassess	1	5.72	0.26	5.29	5.54	5.67	5.86	6.56	▅▇▃▂▁
llotsize	1	8.91	0.54	6.91	8.65	8.77	9.06	11.44	▁▆▇▁▁
lsqrft	1	7.57	0.26	7.07	7.41	7.52	7.71	8.26	▂▇▅▂▁

Check out what’s in df by typing

glimpse(hprice1)

## Observations: 88
## Variables: 10
## $ price    <dbl> 300.000, 370.000, 191.000, 195.000, 373.000, 466.275, 332....
## $ assess   <dbl> 349.1, 351.5, 217.7, 231.8, 319.1, 414.5, 367.8, 300.2, 23...
## $ bdrms    <int> 4, 3, 3, 3, 4, 5, 3, 3, 3, 3, 4, 5, 3, 3, 3, 4, 4, 3, 3, 4...
## $ lotsize  <dbl> 6126, 9903, 5200, 4600, 6095, 8566, 9000, 6210, 6000, 2892...
## $ sqrft    <int> 2438, 2076, 1374, 1448, 2514, 2754, 2067, 1731, 1767, 1890...
## $ colonial <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1...
## $ lprice   <dbl> 5.703783, 5.913503, 5.252274, 5.273000, 5.921578, 6.144775...
## $ lassess  <dbl> 5.855359, 5.862210, 5.383118, 5.445875, 5.765504, 6.027073...
## $ llotsize <dbl> 8.720297, 9.200593, 8.556414, 8.433811, 8.715224, 9.055556...
## $ lsqrft   <dbl> 7.798934, 7.638198, 7.225482, 7.277938, 7.829630, 7.920810...

2.2 Multiple Regression

Let’s estimate the following regression model:

\[ price = \beta_0 + \beta_1 sqrft + \beta_2 bdrms + u \]

where $price$ is the house price in thousands of dollars.

The code to do so is:

est <- lm(price ~ sqrft + bdrms, data=df)
tidy(est)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  -19.3     31.0       -0.622 5.36e- 1
## 2 sqrft          0.128    0.0138     9.29  1.39e-14
## 3 bdrms         15.2      9.48       1.60  1.13e- 1

glance(est)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.632         0.623  63.0      73.0 3.57e-19     3  -488.  984.  994.
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

summary(est)

## 
## Call:
## lm(formula = price ~ sqrft + bdrms, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -127.627  -42.876   -7.051   32.589  229.003 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -19.31500   31.04662  -0.622    0.536    
## sqrft         0.12844    0.01382   9.291 1.39e-14 ***
## bdrms        15.19819    9.48352   1.603    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.04 on 85 degrees of freedom
## Multiple R-squared:  0.6319, Adjusted R-squared:  0.6233 
## F-statistic: 72.96 on 2 and 85 DF,  p-value: < 2.2e-16

You should get a coefficient of 0.128 on sqrft and 15.2 on bdrms. Interpret these coefficients. (You can type the interpretation as a comment in your .R script.) Do these numbers seem reasonable?

You should get $R^2 = 0.632$. Based on that number, do you think this is a good model of house prices?

Check that the average of the residuals is zero:

mean(est$residuals)

## [1] -6.484712e-16

3.3 Adding in non-linearities

The previous regression model had an estimated intercept of -19.3, meaning that a home with no bedrooms and no square footage would be expected to have a sales price of -$19,300.

To fix this, let’s instead use $log(price)$ as the dependent variable, and let’s also add quadratic terms for sqrft and bdrms.

First, let’s use mutate() to add these new variables:

df <- df %>% mutate(logprice = log(price), sqrftSq = sqrft^2, bdrmSq = bdrms^2)

Now run the new model:

est <- lm(logprice ~ sqrft + sqrftSq + bdrms + bdrmSq, data=df)
tidy(est)

## # A tibble: 5 x 5
##   term         estimate    std.error statistic  p.value
##   <chr>           <dbl>        <dbl>     <dbl>    <dbl>
## 1 (Intercept)  5.07e+ 0 0.325          15.6    2.19e-26
## 2 sqrft        3.74e- 4 0.000247        1.51   1.34e- 1
## 3 sqrftSq      7.10e-10 0.0000000508    0.0140 9.89e- 1
## 4 bdrms       -1.30e- 1 0.145          -0.898  3.72e- 1
## 5 bdrmSq       1.99e- 2 0.0178          1.12   2.67e- 1

glance(est)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <int>  <dbl> <dbl> <dbl>
## 1     0.595         0.576 0.198      30.5 1.28e-15     5   20.3 -28.7 -13.8
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

summary(est)

## 
## Call:
## lm(formula = logprice ~ sqrft + sqrftSq + bdrms + bdrmSq, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73800 -0.11292 -0.01387  0.10889  0.62609 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.074e+00  3.254e-01  15.592   <2e-16 ***
## sqrft        3.742e-04  2.474e-04   1.512    0.134    
## sqrftSq      7.098e-10  5.082e-08   0.014    0.989    
## bdrms       -1.302e-01  1.450e-01  -0.898    0.372    
## bdrmSq       1.990e-02  1.780e-02   1.118    0.267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1977 on 83 degrees of freedom
## Multiple R-squared:  0.5953, Adjusted R-squared:  0.5758 
## F-statistic: 30.52 on 4 and 83 DF,  p-value: 1.276e-15

The new coefficients have much smaller magnitudes. Explain why that might be.

The new $R^2 = 0.595$ which is less than $0.632$ from before. Does that mean this model is worse?

3.4 Using the Frisch-Waugh Theorem to obtain partial effects

Let’s experiment with the Frisch-Waugh Theorem, which says:

\[ \hat{\beta}_{1} = \frac{\sum_{i=1}^{N} \hat{r}_{i1}y_{i}}{\sum_{i=1}^{N} \hat{r}_{i1}^2} \]

where $\hat{r}_{i1}$ is the residual from a regression of $x_{1}$ on $x_{2},\ldots,x_{k}$

Let’s do this for the model we just ran. First, regress sqrft on the other $X$’s and store the residuals as a new column in df.

est <- lm(sqrft ~ sqrftSq + bdrms + bdrmSq, data=df)
df <- df %>% mutate(sqrft.resid = 
                      est$residuals)

Now, if we run a simple regression of logprice on sqrft.resid we should get the same coefficient as that of sqrft in the original regression (=3.74e-4).

est <- lm(logprice ~ sqrft.resid, data=df)
tidy(est)

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) 5.63      0.0324     174.    2.33e-111
## 2 sqrft.resid 0.000374  0.000380     0.985 3.28e-  1

est %>% summary()

## 
## Call:
## lm(formula = logprice ~ sqrft.resid, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8888 -0.1714 -0.0728  0.1409  1.0217 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.6331801  0.0323666 174.043   <2e-16 ***
## sqrft.resid 0.0003742  0.0003800   0.985    0.328    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3036 on 86 degrees of freedom
## Multiple R-squared:  0.01115,    Adjusted R-squared:  -0.0003501 
## F-statistic: 0.9696 on 1 and 86 DF,  p-value: 0.3276

3.5 Frisch-Waugh by hand

We can also compute the Frisch-Waugh formula by hand:

beta1 <- sum(df$sqrft.resid*df$logprice)/sum(df$sqrft.resid^2)
print(beta1)

## [1] 0.0003741526

Which indeed gives us what we expected.