The purpose of this in-class lab is to familiarize yourself with R and RStudio. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.
You will see a blank section of screen in the top-left of your RStudio window. This is where you will write your first R script.
The bottom-left of the screen has a tab called “Console”. This is basically a very fancy calculator.
Try the calculator by typing something like
2+2
## [1] 4
Or even something fancier like
sqrt(pi)
## [1] 1.772454
R makes extensive use of third-party packages. We won’t get into the details right now, but for this class, you will need to install a few of these. Installing packages is quite easy. Type the following two lines of code at the very top of your script:
install.packages("tidyverse", repos='http://cran.us.r-project.org')
install.packages("skimr", repos='http://cran.us.r-project.org')
install.packages("wooldridge", repos='http://cran.us.r-project.org')
You’ve just installed two packages. Basically, you’ve downloaded them onto your computer. Just like with other software on your computer, you only need to do the installation once. However, you still need to tell R that you will be using the packages. Add the following two lines of code to your script (below the first two lines you wrote). Notice how there are no quotation marks inside the parenthese this time.
library(tidyverse)
library(skimr)
library(wooldridge)
To execute the script, click on the word “Source” in the top-right corner of the top-left window pane. This will take what is in your script and automatically send it to the console (as if you typed it directly into the console)
To save the script, click on the disk icon at the top of your script pane (but not the disk icon at the very top of RStudio). Name your script ICL1_XYZ.R
where XYZ
are your initials.
Now, put a hashtag (#) in front of the first two lines of code in your script, like so:
#install.packages("tidyverse")
#install.packages("skimr")
#install.packages("wooldridge")
The hashtag is how you tell R not to run the code in your script. This is known as “commenting” your code.
At the very top of your script, type the names of your group members with a hashtag in front.
From now on, add all of the code you see to your script.
Now that you’ve got some of the basics of R, let’s look at some data!
We’re going to load a data set from the wooldridge
package. The data set is called wage1
.
df <- as_tibble(wage1)
What we did there was convert it to a tibble
, which is a nice format for data sets (see Ch. 10 of @r4ds). We called the converted tibble df
, but you can call it whatever you want: mydata
, data123
, whatever.
If you re-run your script (by clicking Source
… or better yet, click on the little arrow next to Source
which opens a menu that you can click Source with echo
), you’ll see something new in the “Environment” window (top-right). It says df
under the “Data” heading.
Double-click on df
in the Environment window. This will show you your data in a format similar to an Excel spreadsheet. You can use this to easily scan the data to make sure things look reasonable.
Let’s look at the summary statistics of one of our variables. Suppose we want to know: What is the average years of education in our sample?
skim(df$educ)
Name | df$educ |
Number of rows | 526 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 0 | 1 | 12.56 | 2.77 | 0 | 12 | 12 | 14 | 18 | ▁▁▂▇▃ |
The $
means that we are looking at the educ
column in the tibble df
.
In your script, write the value of the Mean of educ
, preceded by a comment (the hashtag symbol)
What fraction of the sample is composed of women?
mean(df$female)
## [1] 0.4790875
# or
skim(df) #then look at 'female' column
Name | df |
Number of rows | 526 |
Number of columns | 24 |
_______________________ | |
Column type frequency: | |
numeric | 24 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
wage | 0 | 1 | 5.90 | 3.69 | 0.53 | 3.33 | 4.65 | 6.88 | 24.98 | ▇▅▁▁▁ |
educ | 0 | 1 | 12.56 | 2.77 | 0.00 | 12.00 | 12.00 | 14.00 | 18.00 | ▁▁▂▇▃ |
exper | 0 | 1 | 17.02 | 13.57 | 1.00 | 5.00 | 13.50 | 26.00 | 51.00 | ▇▃▂▂▁ |
tenure | 0 | 1 | 5.10 | 7.22 | 0.00 | 0.00 | 2.00 | 7.00 | 44.00 | ▇▁▁▁▁ |
nonwhite | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
female | 0 | 1 | 0.48 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
married | 0 | 1 | 0.61 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▅▁▁▁▇ |
numdep | 0 | 1 | 1.04 | 1.26 | 0.00 | 0.00 | 1.00 | 2.00 | 6.00 | ▇▂▁▁▁ |
smsa | 0 | 1 | 0.72 | 0.45 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▃▁▁▁▇ |
northcen | 0 | 1 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.75 | 1.00 | ▇▁▁▁▃ |
south | 0 | 1 | 0.36 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
west | 0 | 1 | 0.17 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
construc | 0 | 1 | 0.05 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
ndurman | 0 | 1 | 0.11 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
trcommpu | 0 | 1 | 0.04 | 0.20 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
trade | 0 | 1 | 0.29 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
services | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
profserv | 0 | 1 | 0.26 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▃ |
profocc | 0 | 1 | 0.37 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
clerocc | 0 | 1 | 0.17 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
servocc | 0 | 1 | 0.14 | 0.35 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
lwage | 0 | 1 | 1.62 | 0.53 | -0.63 | 1.20 | 1.54 | 1.93 | 3.22 | ▁▁▇▅▁ |
expersq | 0 | 1 | 473.44 | 616.04 | 1.00 | 25.00 | 182.50 | 676.00 | 2601.00 | ▇▂▁▁▁ |
tenursq | 0 | 1 | 78.15 | 199.43 | 0.00 | 0.00 | 4.00 | 49.00 | 1936.00 | ▇▁▁▁▁ |
# or
skim(df$female)
Name | df$female |
Number of rows | 526 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 0 | 1 | 0.48 | 0.5 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▇ |
Suppose you want to visualize the entire distribution of education. You would use the following code:
ggplot(df, aes(educ)) + geom_histogram(binwidth=1)
In a comment, write the most common value of education (the mode of the distribution) below the code in your script.
Repeat the previous two code snippets, but this time use the wage
variable instead of the educ
variable.
skim(df$wage)
Name | df$wage |
Number of rows | 526 |
Number of columns | 1 |
_______________________ | |
Column type frequency: | |
numeric | 1 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
data | 0 | 1 | 5.9 | 3.69 | 0.53 | 3.33 | 4.65 | 6.88 | 24.98 | ▇▅▁▁▁ |
df %>%
ggplot(aes(wage)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Suppose you want to add a new variable to df
. For example, the wage variable is expressed in 1976 dollars, and you want to know what the wage would be in today’s dollars. (Note: the CPI implies that $1.00 in 1976 equals $4.53 today)
df <- df %>%
mutate(realwage=wage*4.53)
summary(df$realwage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.401 15.085 21.064 26.709 31.166 113.159
You can verify that realwage
got added to df
by clicking on the preview of df
and scrolling all the way over to the right.
For more information on mutate()
, see section 5.5 of @r4ds. You can also drop a variable by typing df <- df %>% mutate(realwage=NULL)
Suppose we want to get rid of all men in our data. (For example, maybe we’re doing research on women’s labor force participation.) To do this, we use the filter()
function.
To use filter()
, you provide conditions for keeping a specific observation. Add the following to your script:
summary(df$female)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4791 1.0000 1.0000
df <- df %>%
dplyr::filter(female==1)
summary(df$female)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 1 1 1 1
We told R to keep observations where female
equals 1. You can see that, prior to the filter()
statement, women were 48% of the data. Now, they are 100%. We can thus verify that filter()
did what we asked it to.
A common occurrence in cross-sectional observational data is missing values. For example, someone leaves blank a question in a survey. Or the wage of someone who is unemployed is not defined. In R, missing values are stored as NA
(meaning “not applicable”). To drop NA
observations, use the is.na
condtion:
summary(df$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.530 3.000 3.750 4.588 5.510 21.630
df <- df %>%
dplyr::filter(!is.na(wage))
summary(df$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.530 3.000 3.750 4.588 5.510 21.630
In this case, we want R to keep non-missing wage observations, so we put !
in front. !
is generally accepted notation for logical negation, i.e. !TRUE
equals FALSE
.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. http://r4ds.had.co.nz.