1介绍R语言和基本数据清洗

The purpose of this in-class lab is to familiarize yourself with R and RStudio. The lab should be completed in your group. To get credit, upload your .R script to the appropriate place on Canvas.

1.1 First steps

Open RStudio on yours or a group member’s laptop
Click File > New File > R script.

You will see a blank section of screen in the top-left of your RStudio window. This is where you will write your first R script.

1.1.1 Console

The bottom-left of the screen has a tab called “Console”. This is basically a very fancy calculator.

Try the calculator by typing something like

2+2

## [1] 4

Or even something fancier like

sqrt(pi)

## [1] 1.772454

1.1.2 Packages

R makes extensive use of third-party packages. We won’t get into the details right now, but for this class, you will need to install a few of these. Installing packages is quite easy. Type the following two lines of code at the very top of your script:

install.packages("tidyverse", repos='http://cran.us.r-project.org')
install.packages("skimr", repos='http://cran.us.r-project.org')
install.packages("wooldridge", repos='http://cran.us.r-project.org')

You’ve just installed two packages. Basically, you’ve downloaded them onto your computer. Just like with other software on your computer, you only need to do the installation once. However, you still need to tell R that you will be using the packages. Add the following two lines of code to your script (below the first two lines you wrote). Notice how there are no quotation marks inside the parenthese this time.

library(tidyverse)
library(skimr)
library(wooldridge)

1.1.3 Running a script

To execute the script, click on the word “Source” in the top-right corner of the top-left window pane. This will take what is in your script and automatically send it to the console (as if you typed it directly into the console)

To save the script, click on the disk icon at the top of your script pane (but not the disk icon at the very top of RStudio). Name your script ICL1_XYZ.R where XYZ are your initials.

1.1.4 Commenting

Now, put a hashtag (#) in front of the first two lines of code in your script, like so:

#install.packages("tidyverse")
#install.packages("skimr")
#install.packages("wooldridge")

The hashtag is how you tell R not to run the code in your script. This is known as “commenting” your code.

At the very top of your script, type the names of your group members with a hashtag in front.

From now on, add all of the code you see to your script.

2.1 Exploring data

Now that you’ve got some of the basics of R, let’s look at some data!

2.1.1 Loading data

We’re going to load a data set from the wooldridge package. The data set is called wage1.

df <- as_tibble(wage1)

What we did there was convert it to a tibble, which is a nice format for data sets (see Ch. 10 of @r4ds). We called the converted tibble df, but you can call it whatever you want: mydata, data123, whatever.

2.1.2 Browsing

If you re-run your script (by clicking Source … or better yet, click on the little arrow next to Source which opens a menu that you can click Source with echo), you’ll see something new in the “Environment” window (top-right). It says df under the “Data” heading.

Double-click on df in the Environment window. This will show you your data in a format similar to an Excel spreadsheet. You can use this to easily scan the data to make sure things look reasonable.

2.1.3 Summary statistics

Let’s look at the summary statistics of one of our variables. Suppose we want to know: What is the average years of education in our sample?

skim(df$educ)

Data summary
Name	df$educ
Number of rows	526
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	12.56	2.77	0	12	12	14	18	▁▁▂▇▃

The $ means that we are looking at the educ column in the tibble df.

In your script, write the value of the Mean of educ, preceded by a comment (the hashtag symbol)

What fraction of the sample is composed of women?

mean(df$female)

## [1] 0.4790875

# or
skim(df) #then look at 'female' column

Data summary
Name	df
Number of rows	526
Number of columns	24
_______________________
Column type frequency:
numeric	24
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
wage	1	5.90	3.69	0.53	3.33	4.65	6.88	24.98	▇▅▁▁▁
educ	1	12.56	2.77	0.00	12.00	12.00	14.00	18.00	▁▁▂▇▃
exper	1	17.02	13.57	1.00	5.00	13.50	26.00	51.00	▇▃▂▂▁
tenure	1	5.10	7.22	0.00	0.00	2.00	7.00	44.00	▇▁▁▁▁
nonwhite	1	0.10	0.30	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
female	1	0.48	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇
married	1	0.61	0.49	0.00	0.00	1.00	1.00	1.00	▅▁▁▁▇
numdep	1	1.04	1.26	0.00	0.00	1.00	2.00	6.00	▇▂▁▁▁
smsa	1	0.72	0.45	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
northcen	1	0.25	0.43	0.00	0.00	0.00	0.75	1.00	▇▁▁▁▃
south	1	0.36	0.48	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▅
west	1	0.17	0.38	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
construc	1	0.05	0.21	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
ndurman	1	0.11	0.32	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
trcommpu	1	0.04	0.20	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
trade	1	0.29	0.45	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▃
services	1	0.10	0.30	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
profserv	1	0.26	0.44	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▃
profocc	1	0.37	0.48	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▅
clerocc	1	0.17	0.37	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▂
servocc	1	0.14	0.35	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
lwage	1	1.62	0.53	-0.63	1.20	1.54	1.93	3.22	▁▁▇▅▁
expersq	1	473.44	616.04	1.00	25.00	182.50	676.00	2601.00	▇▂▁▁▁
tenursq	1	78.15	199.43	0.00	0.00	4.00	49.00	1936.00	▇▁▁▁▁

# or
skim(df$female)

Data summary
Name	df$female
Number of rows	526
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	0.48	0.5	0	0	0	1	1	▇▁▁▁▇

2.1.4 Visualization

Suppose you want to visualize the entire distribution of education. You would use the following code:

ggplot(df, aes(educ)) + geom_histogram(binwidth=1)

In a comment, write the most common value of education (the mode of the distribution) below the code in your script.

Repeat the previous two code snippets, but this time use the wage variable instead of the educ variable.

skim(df$wage)

Data summary
Name	df$wage
Number of rows	526
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	5.9	3.69	0.53	3.33	4.65	6.88	24.98	▇▅▁▁▁

df %>% 
  ggplot(aes(wage)) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.1.5 Creating a new variable

Suppose you want to add a new variable to df. For example, the wage variable is expressed in 1976 dollars, and you want to know what the wage would be in today’s dollars. (Note: the CPI implies that $1.00 in 1976 equals $4.53 today)

df <- df %>% 
  mutate(realwage=wage*4.53)
summary(df$realwage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.401  15.085  21.064  26.709  31.166 113.159

You can verify that realwage got added to df by clicking on the preview of df and scrolling all the way over to the right.

For more information on mutate(), see section 5.5 of @r4ds. You can also drop a variable by typing df <- df %>% mutate(realwage=NULL)

2.1.6 Dropping observations

Suppose we want to get rid of all men in our data. (For example, maybe we’re doing research on women’s labor force participation.) To do this, we use the filter() function.

To use filter(), you provide conditions for keeping a specific observation. Add the following to your script:

summary(df$female)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4791  1.0000  1.0000

df <- df %>% 
  dplyr::filter(female==1)
summary(df$female)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

We told R to keep observations where female equals 1. You can see that, prior to the filter() statement, women were 48% of the data. Now, they are 100%. We can thus verify that filter() did what we asked it to.

2.1.7 Missing values

A common occurrence in cross-sectional observational data is missing values. For example, someone leaves blank a question in a survey. Or the wage of someone who is unemployed is not defined. In R, missing values are stored as NA (meaning “not applicable”). To drop NA observations, use the is.na condtion:

summary(df$wage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.530   3.000   3.750   4.588   5.510  21.630

df <- df %>% 
  dplyr::filter(!is.na(wage))
summary(df$wage)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.530   3.000   3.750   4.588   5.510  21.630

In this case, we want R to keep non-missing wage observations, so we put ! in front. ! is generally accepted notation for logical negation, i.e. !TRUE equals FALSE.

2.2 References

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media. http://r4ds.had.co.nz.

1介绍R语言和基本数据清洗

LJJ

2020/3/28

1.1 First steps

1.1.1 Console

1.1.2 Packages

1.1.3 Running a script

1.1.4 Commenting

2.1 Exploring data

2.1.1 Loading data

2.1.2 Browsing

2.1.3 Summary statistics

2.1.4 Visualization

2.1.5 Creating a new variable

2.1.6 Dropping observations

2.1.7 Missing values

2.2 References