The purpose of this lab2 is to practice using R to conduct hypothesis tests and run a basic OLS regression. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL2_XYZ.R
, where XYZ
are your initials) and add the following to the top:
library(tidyverse)
library(skimr)
library(broom)
library(wooldridge)
Load the dataset audit
from the wooldridge
package, like so:
df <- as_tibble(audit)
df %>% glimpse()
## Observations: 241
## Variables: 3
## $ w <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ b <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ y <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
The audit
data set contains three variables: w
, b
, and y
. The variables b
and w
respectively denote whether the black or white member of a pair of resumes was offered a job by the same employer. y
is simply the difference between the two, i.e. y=b-w
.
We want to test the following hypothesis:
\[ H_{0}: \mu=0 \\ H_{a}: \mu<0 \] where \(\mu = \theta_{B}-\theta_{W}\), i.e. the difference in respective job offer rates for blacks and whites.
t.test()
function in RTo conduct a t-test in R, simply provide the appropriate information to the t.test
function.
How do you know what the “appropriate information” is?
?t.test
and hit enter.Usage
it says t.test(x, ...)
.
x
. The ...
signals that we can provide it more than just x
.Arguments
it explains what x
is: “a (non-empty) numeric vector of data values”
t.test()
t.test()
. For example:
alternative
is "two.sided"
by defaultmu
is 0 by defaultNow let’s do the hypothesis test written above. Add the following code to your script:
t.test(df$y,alternative="less")
##
## One Sample t-test
##
## data: df$y
## t = -4.2768, df = 240, p-value = 1.369e-05
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
## -Inf -0.08151529
## sample estimates:
## mean of x
## -0.1327801
R automatically computes for us the t-statistic using the formula
\[ \frac{\overline{y} - \mu}{SE_{\bar{y}}} \]
All we had to give R was the sample of data (y
, in our case) and the null value (0, which is the t.test
default)!
t.test()
Now that we’ve conducted the t-test, how do we know the result of our hypothesis test? If you run your script, you should see something like
> t.test(df$y,alternative="less")
One Sample t-test
data: df$y
t = -4.2768, df = 240, p-value = 1.369e-05
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
-Inf -0.08151529
sample estimates:
mean of x
-0.1327801
R reports the value of the t-statistic, how many degrees of freedom, and the p-value associated with the test. R does not report the critical value, but the p-value provides the same information.
In this case, our p-value is approximatley 0.00001369, which is much lower than 0.05 (our significance level). Thus, we reject \(H_0\).
Now suppose instead we want to test if job offer rates of blacks are different from those of whites. We want to test the following hypothesis:
\[ H_0: \theta_{b}=\theta_{w}; \\ H_a: \theta_{b}\neq\theta_{w} \]
This hypothesis test considers the case where there might be reverse discrimination (e.g. through affirmative action policies).
The code to conduct this test is similar to the code we used previously. (Add the following code to your script:)
t.test(df$b,df$w,alternative="two.sided",paired=TRUE)
##
## Paired t-test
##
## data: df$b and df$w
## t = -4.2768, df = 240, p-value = 2.739e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1939385 -0.0716217
## sample estimates:
## mean of the differences
## -0.1327801
You’ll notice that the t-statistic is the exact same (-4.2768) for both of the tests. But the p-value for the two-tailed test is twice as large (0.00002739). This is because the two-tailed test must allow for the possibility of either direction of the \(\neq\) sign. (In other words, that the job offer rate for blacks could be higher or lower than for whites.)
Let’s load a new data set and run an OLS regression. This data set contains year-by-year statistics about counties in the US. It has counts on number of various crimes committed, as well as demographic characteristics about the county.
df <- as_tibble(countymurders)
df %>% skim()
Name | Piped data |
Number of rows | 37349 |
Number of columns | 20 |
_______________________ | |
Column type frequency: | |
numeric | 20 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
arrests | 504 | 0.99 | 6.78 | 50.13 | 0.00 | 0.00 | 1.00 | 3.00 | 2391.00 | ▇▁▁▁▁ |
countyid | 0 | 1.00 | 32921.93 | 15528.35 | 1001.00 | 20105.00 | 36065.00 | 48049.00 | 56045.00 | ▃▅▅▆▇ |
density | 0 | 1.00 | 252.24 | 1663.77 | 0.05 | 17.68 | 44.24 | 106.60 | 54058.77 | ▇▁▁▁▁ |
popul | 0 | 1.00 | 89343.54 | 271854.50 | 85.00 | 13144.00 | 28792.00 | 66480.00 | 9127751.00 | ▇▁▁▁▁ |
perc1019 | 0 | 1.00 | 15.58 | 1.97 | 7.08 | 14.32 | 15.42 | 16.73 | 30.48 | ▁▇▃▁▁ |
perc2029 | 0 | 1.00 | 14.58 | 3.70 | 5.62 | 12.30 | 14.27 | 16.20 | 40.52 | ▃▇▁▁▁ |
percblack | 0 | 1.00 | 7.82 | 13.29 | 0.00 | 0.20 | 1.45 | 8.74 | 86.28 | ▇▁▁▁▁ |
percmale | 0 | 1.00 | 43.35 | 3.72 | 35.15 | 40.90 | 41.81 | 45.87 | 78.04 | ▇▃▁▁▁ |
rpcincmaint | 3 | 1.00 | 165.45 | 97.49 | 5.49 | 96.25 | 145.15 | 209.88 | 1306.50 | ▇▁▁▁▁ |
rpcpersinc | 3 | 1.00 | 11272.29 | 2680.75 | 3477.76 | 9598.40 | 10907.50 | 12425.63 | 41094.22 | ▇▇▁▁▁ |
rpcunemins | 3 | 1.00 | 70.56 | 52.91 | 0.00 | 35.20 | 57.10 | 89.96 | 642.73 | ▇▁▁▁▁ |
year | 0 | 1.00 | 1988.00 | 4.90 | 1980.00 | 1984.00 | 1988.00 | 1992.00 | 1996.00 | ▇▆▆▆▇ |
murders | 0 | 1.00 | 7.29 | 47.22 | 0.00 | 0.00 | 1.00 | 3.00 | 1944.00 | ▇▁▁▁▁ |
murdrate | 0 | 1.00 | 0.51 | 0.85 | 0.00 | 0.00 | 0.24 | 0.74 | 39.84 | ▇▁▁▁▁ |
arrestrate | 504 | 0.99 | 0.51 | 1.23 | 0.00 | 0.00 | 0.16 | 0.70 | 148.66 | ▇▁▁▁▁ |
statefips | 0 | 1.00 | 32.82 | 15.50 | 1.00 | 20.00 | 36.00 | 48.00 | 56.00 | ▃▅▅▆▇ |
countyfips | 0 | 1.00 | 100.35 | 107.94 | 1.00 | 33.00 | 75.00 | 127.00 | 840.00 | ▇▁▁▁▁ |
execs | 0 | 1.00 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 7.00 | ▇▁▁▁▁ |
lpopul | 0 | 1.00 | 10.35 | 1.33 | 4.44 | 9.48 | 10.27 | 11.10 | 16.03 | ▁▂▇▂▁ |
execrate | 0 | 1.00 | 0.00 | 0.03 | 0.00 | 0.00 | 0.00 | 0.00 | 2.39 | ▇▁▁▁▁ |
A handy command to get a quick overview of an unfamiliar dataset is glimpse()
:
glimpse(df)
## Observations: 37,349
## Variables: 20
## $ arrests <int> 2, 3, 2, 7, 3, 1, 1, 2, 0, 5, 0, 1, 5, 3, 4, 5, 8, 4, 9...
## $ countyid <int> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1...
## $ density <dbl> 54.05000, 53.66000, 53.75000, 53.78000, 53.91000, 54.11...
## $ popul <int> 32216, 31984, 32036, 32056, 32128, 32248, 32888, 33264,...
## $ perc1019 <dbl> 20.63000, 20.19000, 19.66000, 19.10000, 18.54000, 18.06...
## $ perc2029 <dbl> 15.28000, 15.55000, 15.73000, 15.88000, 15.92000, 15.87...
## $ percblack <dbl> 22.33000, 22.07000, 21.80000, 21.53000, 21.26000, 20.96...
## $ percmale <dbl> 40.25000, 40.36000, 40.42000, 40.47000, 40.51000, 40.45...
## $ rpcincmaint <dbl> 167.670, 167.990, 166.630, 176.530, 166.250, 153.120, 1...
## $ rpcpersinc <dbl> 8780.80, 8232.80, 8327.61, 8545.55, 8965.16, 9254.02, 9...
## $ rpcunemins <dbl> 29.160, 43.920, 71.410, 72.220, 40.360, 44.540, 38.350,...
## $ year <int> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1...
## $ murders <int> 2, 1, 3, 7, 2, 2, 4, 1, 0, 3, 1, 1, 1, 1, 1, 5, 7, 4, 6...
## $ murdrate <dbl> 0.6208096, 0.3126563, 0.9364465, 2.1836790, 0.6225100, ...
## $ arrestrate <dbl> 0.6208095, 0.9379690, 0.6242977, 2.1836790, 0.9337650, ...
## $ statefips <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ countyfips <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3...
## $ execs <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ lpopul <dbl> 10.38022, 10.37299, 10.37462, 10.37524, 10.37748, 10.38...
## $ execrate <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
glimpse()
tells you the number of observations, number of variables, and the name and type of each variable (e.g. integer, double).1
To run a regression of \(y\) on \(x\) in R, use the following syntax:
est <- lm(y ~ x, data=data.name)
Here, est
is an object where the regression coefficients (and other information about the model) is stored. lm()
stands for “linear model” and is the function that you call to tell R to compute the OLS coefficients. y
and x
are variables names from whatever tibble
you’ve stored your data in. The name of the tibble
is data.name
.
Using the df
data set we created above, let’s run a regression where murders
is the dependent variable and execs
is the independent variable:
est <- lm(murders ~ execs, data=df)
To view the output of the regression in a friendly format, type
broom::tidy(est)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.84 0.242 28.3 3.97e-174
## 2 execs 65.5 2.15 30.5 7.44e-202
In the estimate
column, we can see the estimated coefficients for \(\beta_0\)—(Intercept)
in this case—and \(\beta_1\) (execs
). est
also contains other information that we will use later in the course.
You can also look at the \(R^2\) by typing
glance(est)
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.0243 0.0243 46.6 930. 7.44e-202 2 -1.97e5 3.93e5 3.93e5
## # ... with 2 more variables: deviance <dbl>, df.residual <int>
Again, there’s a lot of information here, but for now just focus on the \(R^2\) term reported in the first column.
“double” means “double precision floating point” and is a computer science-y way of expressing a real number (as opposed to an integer or a rational number).↩