The purpose of this lab2 is to practice using R to conduct hypothesis tests and run a basic OLS regression. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

2.1 For starters

Open up a new R script (named ICL2_XYZ.R, where XYZ are your initials) and add the following to the top:

library(tidyverse)
library(skimr)
library(broom)
library(wooldridge)

Load the dataset audit from the wooldridge package, like so:

df <- as_tibble(audit)
df %>% glimpse()
## Observations: 241
## Variables: 3
## $ w <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ b <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ y <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

2.2 A one-tailed test

The audit data set contains three variables: w, b, and y. The variables b and w respectively denote whether the black or white member of a pair of resumes was offered a job by the same employer. y is simply the difference between the two, i.e. y=b-w.

We want to test the following hypothesis:

\[ H_{0}: \mu=0 \\ H_{a}: \mu<0 \] where \(\mu = \theta_{B}-\theta_{W}\), i.e. the difference in respective job offer rates for blacks and whites.

2.2.1 The t.test() function in R

To conduct a t-test in R, simply provide the appropriate information to the t.test function.

How do you know what the “appropriate information” is?

  • In the RStudio console, type ?t.test and hit enter.
  • A help page should open in the bottom-right of your RStudio screen. The page should say “Student’s t-Test”
  • Under Usage it says t.test(x, ...).
    • This means that at minimum we only have to provide it with is some object x. The ... signals that we can provide it more than just x.
  • Under Arguments it explains what x is: “a (non-empty) numeric vector of data values”
    • This means that R is expecting us to pass a column of a data frame to t.test()
  • The other information in the help explains default settings of t.test(). For example:
    • alternative is "two.sided" by default
    • mu is 0 by default
    • … other options that we won’t worry about right now

Now let’s do the hypothesis test written above. Add the following code to your script:

t.test(df$y,alternative="less")
## 
##  One Sample t-test
## 
## data:  df$y
## t = -4.2768, df = 240, p-value = 1.369e-05
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##         -Inf -0.08151529
## sample estimates:
##  mean of x 
## -0.1327801

R automatically computes for us the t-statistic using the formula

\[ \frac{\overline{y} - \mu}{SE_{\bar{y}}} \]

All we had to give R was the sample of data (y, in our case) and the null value (0, which is the t.test default)!

2.2.2 Interpreting the output of t.test()

Now that we’ve conducted the t-test, how do we know the result of our hypothesis test? If you run your script, you should see something like

> t.test(df$y,alternative="less")

    One Sample t-test

data:  df$y
t = -4.2768, df = 240, p-value = 1.369e-05
alternative hypothesis: true mean is less than 0
95 percent confidence interval:
        -Inf -0.08151529
sample estimates:
 mean of x 
-0.1327801

R reports the value of the t-statistic, how many degrees of freedom, and the p-value associated with the test. R does not report the critical value, but the p-value provides the same information.

In this case, our p-value is approximatley 0.00001369, which is much lower than 0.05 (our significance level). Thus, we reject \(H_0\).

2.3 A two-tailed test

Now suppose instead we want to test if job offer rates of blacks are different from those of whites. We want to test the following hypothesis:

\[ H_0: \theta_{b}=\theta_{w}; \\ H_a: \theta_{b}\neq\theta_{w} \]

This hypothesis test considers the case where there might be reverse discrimination (e.g. through affirmative action policies).

The code to conduct this test is similar to the code we used previously. (Add the following code to your script:)

t.test(df$b,df$w,alternative="two.sided",paired=TRUE)
## 
##  Paired t-test
## 
## data:  df$b and df$w
## t = -4.2768, df = 240, p-value = 2.739e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1939385 -0.0716217
## sample estimates:
## mean of the differences 
##              -0.1327801

You’ll notice that the t-statistic is the exact same (-4.2768) for both of the tests. But the p-value for the two-tailed test is twice as large (0.00002739). This is because the two-tailed test must allow for the possibility of either direction of the \(\neq\) sign. (In other words, that the job offer rate for blacks could be higher or lower than for whites.)

2.4 Your first regression (of this class)

Let’s load a new data set and run an OLS regression. This data set contains year-by-year statistics about counties in the US. It has counts on number of various crimes committed, as well as demographic characteristics about the county.

df <- as_tibble(countymurders)
df %>% skim()
Data summary
Name Piped data
Number of rows 37349
Number of columns 20
_______________________
Column type frequency:
numeric 20
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
arrests 504 0.99 6.78 50.13 0.00 0.00 1.00 3.00 2391.00 ▇▁▁▁▁
countyid 0 1.00 32921.93 15528.35 1001.00 20105.00 36065.00 48049.00 56045.00 ▃▅▅▆▇
density 0 1.00 252.24 1663.77 0.05 17.68 44.24 106.60 54058.77 ▇▁▁▁▁
popul 0 1.00 89343.54 271854.50 85.00 13144.00 28792.00 66480.00 9127751.00 ▇▁▁▁▁
perc1019 0 1.00 15.58 1.97 7.08 14.32 15.42 16.73 30.48 ▁▇▃▁▁
perc2029 0 1.00 14.58 3.70 5.62 12.30 14.27 16.20 40.52 ▃▇▁▁▁
percblack 0 1.00 7.82 13.29 0.00 0.20 1.45 8.74 86.28 ▇▁▁▁▁
percmale 0 1.00 43.35 3.72 35.15 40.90 41.81 45.87 78.04 ▇▃▁▁▁
rpcincmaint 3 1.00 165.45 97.49 5.49 96.25 145.15 209.88 1306.50 ▇▁▁▁▁
rpcpersinc 3 1.00 11272.29 2680.75 3477.76 9598.40 10907.50 12425.63 41094.22 ▇▇▁▁▁
rpcunemins 3 1.00 70.56 52.91 0.00 35.20 57.10 89.96 642.73 ▇▁▁▁▁
year 0 1.00 1988.00 4.90 1980.00 1984.00 1988.00 1992.00 1996.00 ▇▆▆▆▇
murders 0 1.00 7.29 47.22 0.00 0.00 1.00 3.00 1944.00 ▇▁▁▁▁
murdrate 0 1.00 0.51 0.85 0.00 0.00 0.24 0.74 39.84 ▇▁▁▁▁
arrestrate 504 0.99 0.51 1.23 0.00 0.00 0.16 0.70 148.66 ▇▁▁▁▁
statefips 0 1.00 32.82 15.50 1.00 20.00 36.00 48.00 56.00 ▃▅▅▆▇
countyfips 0 1.00 100.35 107.94 1.00 33.00 75.00 127.00 840.00 ▇▁▁▁▁
execs 0 1.00 0.01 0.11 0.00 0.00 0.00 0.00 7.00 ▇▁▁▁▁
lpopul 0 1.00 10.35 1.33 4.44 9.48 10.27 11.10 16.03 ▁▂▇▂▁
execrate 0 1.00 0.00 0.03 0.00 0.00 0.00 0.00 2.39 ▇▁▁▁▁

A handy command to get a quick overview of an unfamiliar dataset is glimpse():

glimpse(df)
## Observations: 37,349
## Variables: 20
## $ arrests     <int> 2, 3, 2, 7, 3, 1, 1, 2, 0, 5, 0, 1, 5, 3, 4, 5, 8, 4, 9...
## $ countyid    <int> 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1001, 1...
## $ density     <dbl> 54.05000, 53.66000, 53.75000, 53.78000, 53.91000, 54.11...
## $ popul       <int> 32216, 31984, 32036, 32056, 32128, 32248, 32888, 33264,...
## $ perc1019    <dbl> 20.63000, 20.19000, 19.66000, 19.10000, 18.54000, 18.06...
## $ perc2029    <dbl> 15.28000, 15.55000, 15.73000, 15.88000, 15.92000, 15.87...
## $ percblack   <dbl> 22.33000, 22.07000, 21.80000, 21.53000, 21.26000, 20.96...
## $ percmale    <dbl> 40.25000, 40.36000, 40.42000, 40.47000, 40.51000, 40.45...
## $ rpcincmaint <dbl> 167.670, 167.990, 166.630, 176.530, 166.250, 153.120, 1...
## $ rpcpersinc  <dbl> 8780.80, 8232.80, 8327.61, 8545.55, 8965.16, 9254.02, 9...
## $ rpcunemins  <dbl> 29.160, 43.920, 71.410, 72.220, 40.360, 44.540, 38.350,...
## $ year        <int> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1...
## $ murders     <int> 2, 1, 3, 7, 2, 2, 4, 1, 0, 3, 1, 1, 1, 1, 1, 5, 7, 4, 6...
## $ murdrate    <dbl> 0.6208096, 0.3126563, 0.9364465, 2.1836790, 0.6225100, ...
## $ arrestrate  <dbl> 0.6208095, 0.9379690, 0.6242977, 2.1836790, 0.9337650, ...
## $ statefips   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ countyfips  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 3...
## $ execs       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ lpopul      <dbl> 10.38022, 10.37299, 10.37462, 10.37524, 10.37748, 10.38...
## $ execrate    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

glimpse() tells you the number of observations, number of variables, and the name and type of each variable (e.g. integer, double).1

2.4.1 Regression syntax

To run a regression of \(y\) on \(x\) in R, use the following syntax:

est <- lm(y ~ x, data=data.name)

Here, est is an object where the regression coefficients (and other information about the model) is stored. lm() stands for “linear model” and is the function that you call to tell R to compute the OLS coefficients. y and x are variables names from whatever tibble you’ve stored your data in. The name of the tibble is data.name.

2.4.2 Regress murder rate on execution rate

Using the df data set we created above, let’s run a regression where murders is the dependent variable and execs is the independent variable:

est <- lm(murders ~ execs, data=df)

To view the output of the regression in a friendly format, type

broom::tidy(est)
## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)     6.84     0.242      28.3 3.97e-174
## 2 execs          65.5      2.15       30.5 7.44e-202

In the estimate column, we can see the estimated coefficients for \(\beta_0\)(Intercept) in this case—and \(\beta_1\) (execs). est also contains other information that we will use later in the course.

You can also look at the \(R^2\) by typing

glance(est)
## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic   p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <int>   <dbl>  <dbl>  <dbl>
## 1    0.0243        0.0243  46.6      930. 7.44e-202     2 -1.97e5 3.93e5 3.93e5
## # ... with 2 more variables: deviance <dbl>, df.residual <int>

Again, there’s a lot of information here, but for now just focus on the \(R^2\) term reported in the first column.


  1. “double” means “double precision floating point” and is a computer science-y way of expressing a real number (as opposed to an integer or a rational number).