The purpose of this in-class lab6 is to practice using dummy variables in R. The lab6 should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL6_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
# Add names of group members HERE
library(tidyverse)
library(broom)
library(wooldridge)
library(skimr)
library(car)
Also install the package magrittr
by typing in the console:
# install.packages("magrittr", repos='http://cran.us.r-project.org')
and then add to the preamble of your script
library(magrittr)
The magrittr
package contains extra features for writing even more expressive code.
We’ll use a new data set on extramarital affiars, called affairs
.
df <- as_tibble(affairs)
df %>% skim()
Name | Piped data |
Number of rows | 601 |
Number of columns | 19 |
_______________________ | |
Column type frequency: | |
numeric | 19 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 1059.72 | 914.90 | 4.00 | 528 | 1009 | 1453 | 9029 | ▇▁▁▁▁ |
male | 0 | 1 | 0.48 | 0.50 | 0.00 | 0 | 0 | 1 | 1 | ▇▁▁▁▇ |
age | 0 | 1 | 32.49 | 9.29 | 17.50 | 27 | 32 | 37 | 57 | ▃▇▂▂▁ |
yrsmarr | 0 | 1 | 8.18 | 5.57 | 0.12 | 4 | 7 | 15 | 15 | ▆▅▃▃▇ |
kids | 0 | 1 | 0.72 | 0.45 | 0.00 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
relig | 0 | 1 | 3.12 | 1.17 | 1.00 | 2 | 3 | 4 | 5 | ▂▇▆▇▃ |
educ | 0 | 1 | 16.17 | 2.40 | 9.00 | 14 | 16 | 18 | 20 | ▁▂▆▇▇ |
occup | 0 | 1 | 4.19 | 1.82 | 1.00 | 3 | 5 | 6 | 7 | ▅▂▂▇▆ |
ratemarr | 0 | 1 | 3.93 | 1.10 | 1.00 | 3 | 4 | 5 | 5 | ▁▂▃▇▇ |
naffairs | 0 | 1 | 1.46 | 3.30 | 0.00 | 0 | 0 | 0 | 12 | ▇▁▁▁▁ |
affair | 0 | 1 | 0.25 | 0.43 | 0.00 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
vryhap | 0 | 1 | 0.39 | 0.49 | 0.00 | 0 | 0 | 1 | 1 | ▇▁▁▁▅ |
hapavg | 0 | 1 | 0.32 | 0.47 | 0.00 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
avgmarr | 0 | 1 | 0.15 | 0.36 | 0.00 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
unhap | 0 | 1 | 0.11 | 0.31 | 0.00 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
vryrel | 0 | 1 | 0.12 | 0.32 | 0.00 | 0 | 0 | 0 | 1 | ▇▁▁▁▁ |
smerel | 0 | 1 | 0.32 | 0.47 | 0.00 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
slghtrel | 0 | 1 | 0.21 | 0.41 | 0.00 | 0 | 0 | 0 | 1 | ▇▁▁▁▂ |
notrel | 0 | 1 | 0.27 | 0.45 | 0.00 | 0 | 0 | 1 | 1 | ▇▁▁▁▃ |
Check out what’s in the data by typing
glimpse(df)
## Observations: 601
## Variables: 19
## $ id <int> 4, 5, 6, 11, 12, 16, 23, 29, 43, 44, 45, 47, 49, 50, 53, 5...
## $ male <int> 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0...
## $ age <dbl> 37, 27, 27, 32, 27, 57, 22, 32, 37, 22, 57, 32, 22, 37, 32...
## $ yrsmarr <dbl> 10.000, 4.000, 1.500, 15.000, 4.000, 15.000, 0.750, 1.500,...
## $ kids <int> 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0...
## $ relig <int> 3, 4, 3, 1, 3, 5, 2, 2, 5, 2, 2, 4, 4, 2, 3, 4, 5, 4, 2, 2...
## $ educ <int> 18, 14, 18, 12, 17, 18, 17, 17, 18, 12, 14, 16, 14, 20, 17...
## $ occup <int> 7, 6, 4, 1, 1, 6, 6, 5, 6, 1, 4, 1, 4, 7, 5, 6, 6, 5, 1, 5...
## $ ratemarr <int> 4, 4, 4, 4, 5, 5, 3, 5, 2, 3, 4, 2, 5, 2, 2, 4, 4, 5, 5, 4...
## $ naffairs <int> 0, 0, 3, 0, 3, 0, 0, 0, 7, 0, 0, 0, 0, 0, 12, 0, 0, 1, 1, ...
## $ affair <int> 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0...
## $ vryhap <int> 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0...
## $ hapavg <int> 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1...
## $ avgmarr <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ unhap <int> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0...
## $ vryrel <int> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0...
## $ smerel <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0...
## $ slghtrel <int> 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0...
## $ notrel <int> 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1...
You’ll notice that there are a number of variables that only take on 0/1 values: male
, kids
, affair
, hapavg
, vryrel
, etc. There are also variables that take on a few different values: relig
, occup
, and ratemarr
.
Let’s convert our 0/1 numeric variable male
to a factor with levels "male"
and "female"
:
df %<>% mutate(male = factor(male),
male = fct_recode(male, yes = "1", no = "0"))
The %<>%
operator is shorthand for df <- df %>% mutate(...)
. In other words, %<>%
pipes forwards and then pipes everything backwards.
The first part of the mutate()
function converts the 0/1 values to categories named "0"
and "1"
. The second part gives the categories more descriptive labels ("male"
and "female"
).
Let’s repeat this for some of the other variables: ratemarr
, relig
, kids
, and affair
:
df %<>% mutate(ratemarr = factor(ratemarr),
ratemarr = fct_recode(ratemarr,
very_happy = "5",
happy = "4",
average = "3",
unhappy = "2",
very_unhappy = "1")) %>%
mutate(relig = factor(relig),
relig = fct_recode(relig,
very_relig = "5",
relig = "4",
average = "3",
not_relig = "2",
not_at_all_relig = "1")) %>%
mutate(kids = factor(kids),
kids = fct_recode(kids,
yes = "1",
no = "0")) %>%
mutate(affair = factor(affair),
affair = fct_recode(affair,
yes = "1",
no = "0"))
df %>% DT::datatable()
where we used multiple pipe operators (%>%
) to do each factor recode in a separate step. (You could have also put them all into one giant mutate()
statement.)
Do another glimpse(df)
to make sure the code worked in the way I told you it would.
You can look at the frequency of factor variables using the table()
function:
table(df$ratemarr)
##
## very_unhappy unhappy average happy very_happy
## 16 66 93 194 232
table(df$relig)
##
## not_at_all_relig not_relig average relig
## 48 164 129 190
## very_relig
## 70
table(df$ratemarr,df$kids)
##
## no yes
## very_unhappy 3 13
## unhappy 8 58
## average 24 69
## happy 40 154
## very_happy 96 136
You can also use the prop.table()
function to get shares within-row (margin=1
) or within-column (margin=2
):
table(df$ratemarr) %>% prop.table()
##
## very_unhappy unhappy average happy very_happy
## 0.0266223 0.1098170 0.1547421 0.3227953 0.3860233
table(df$ratemarr,df$kids) %>% prop.table(margin=1)
##
## no yes
## very_unhappy 0.1875000 0.8125000
## unhappy 0.1212121 0.8787879
## average 0.2580645 0.7419355
## happy 0.2061856 0.7938144
## very_happy 0.4137931 0.5862069
table(df$ratemarr,df$kids) %>% prop.table(margin=2)
##
## no yes
## very_unhappy 0.01754386 0.03023256
## unhappy 0.04678363 0.13488372
## average 0.14035088 0.16046512
## happy 0.23391813 0.35813953
## very_happy 0.56140351 0.31627907
You can also create a histogram of a factor variable in ggplot()
as follows:
ggplot(df,aes(x=ratemarr)) +
geom_bar()
This helps you visualize what share of the data falls into which category.
Let’s run a regression with naffairs
as the dependent variable and male
, yrsmarr
, kids
, and ratemarr
as the covariates.
df %>%
dplyr::select(male,yrsmarr,kids,ratemarr) %>%
map(table)
## $male
##
## no yes
## 315 286
##
## $yrsmarr
##
## 0.125 0.416999995708466 0.75 1.5
## 11 10 31 88
## 4 7 10 15
## 105 82 70 204
##
## $kids
##
## no yes
## 171 430
##
## $ratemarr
##
## very_unhappy unhappy average happy very_happy
## 16 66 93 194 232
est1 <- lm(naffairs ~ male + yrsmarr + kids + ratemarr, data=df)
summary(est1)
##
## Call:
## lm(formula = naffairs ~ male + yrsmarr + kids + ratemarr, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3742 -1.4138 -0.7915 -0.3155 11.8963
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.93389 0.84458 3.474 0.000551 ***
## maleyes 0.08284 0.25688 0.322 0.747209
## yrsmarr 0.08612 0.02846 3.026 0.002586 **
## kidsyes -0.21174 0.34867 -0.607 0.543906
## ratemarrunhappy 0.27746 0.87673 0.316 0.751755
## ratemarraverage -2.13596 0.85369 -2.502 0.012617 *
## ratemarrhappy -2.27518 0.82104 -2.771 0.005762 **
## ratemarrvery_happy -2.68300 0.82231 -3.263 0.001167 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.127 on 593 degrees of freedom
## Multiple R-squared: 0.1121, Adjusted R-squared: 0.1016
## F-statistic: 10.7 on 7 and 593 DF, p-value: 1.017e-12
Interpret the coefficient on ratemarrvery_happy
.
Let’s run the same regression as before, but this time use affair
as the dependent variable. What happens when you run the following code?
est2 <- lm(affair ~ male + yrsmarr + kids + ratemarr, data=df)
est2
##
## Call:
## lm(formula = affair ~ male + yrsmarr + kids + ratemarr, data = df)
##
## Coefficients:
## (Intercept) maleyes yrsmarr kidsyes
## 1.393596 0.039327 0.004154 0.051283
## ratemarrunhappy ratemarraverage ratemarrhappy ratemarrvery_happy
## 0.001378 -0.197250 -0.244766 -0.321248
R doesn’t want you to run a LPM because R was designed by statisticians who focus more on the “cons” of LPMs than on the “pros.”
To run the LPM, adjust the code by using as.numeric(affair)
as the dependent variable. Interpret the coefficients on ratemarraverage
and kidsyes
.
Finally, let’s run a more flexible model where we allow the effect of fathers and mothers to be different. The way to do this in lm()
is as follows:
est3 <- lm(as.numeric(affair) ~ male*kids + yrsmarr + ratemarr, data=df)
print(tidy(est3))
## # A tibble: 9 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.39 0.116 12.0 7.09e-30
## 2 maleyes 0.0549 0.0652 0.842 4.00e- 1
## 3 kidsyes 0.0616 0.0594 1.04 3.00e- 1
## 4 yrsmarr 0.00407 0.00382 1.06 2.88e- 1
## 5 ratemarrunhappy 0.000797 0.117 0.00680 9.95e- 1
## 6 ratemarraverage -0.197 0.114 -1.73 8.45e- 2
## 7 ratemarrhappy -0.244 0.110 -2.22 2.66e- 2
## 8 ratemarrvery_happy -0.321 0.110 -2.91 3.71e- 3
## 9 maleyes:kidsyes -0.0216 0.0770 -0.281 7.79e- 1
summary(est3)
##
## Call:
## lm(formula = as.numeric(affair) ~ male * kids + yrsmarr + ratemarr,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54352 -0.26529 -0.17741 -0.06676 0.93070
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3868488 0.1155082 12.006 < 2e-16 ***
## maleyes 0.0549119 0.0651912 0.842 0.39995
## kidsyes 0.0616286 0.0594000 1.038 0.29992
## yrsmarr 0.0040658 0.0038206 1.064 0.28769
## ratemarrunhappy 0.0007974 0.1173111 0.007 0.99458
## ratemarraverage -0.1973730 0.1142113 -1.728 0.08448 .
## ratemarrhappy -0.2441751 0.1098618 -2.223 0.02662 *
## ratemarrvery_happy -0.3205937 0.1100368 -2.914 0.00371 **
## maleyes:kidsyes -0.0216487 0.0769522 -0.281 0.77856
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4183 on 592 degrees of freedom
## Multiple R-squared: 0.07977, Adjusted R-squared: 0.06733
## F-statistic: 6.415 on 8 and 592 DF, p-value: 5.186e-08
The coefficient on the interaction term is labeled maleyes:kidsyes
. Do fathers have a differential rate of extramarital affairs compared to mothers?