The purpose of this in-class lab is to familiarize yourself with R and RStudio. The lab may be completed as a group, but each student should turn in their own work. To get credit, upload your .R script to the appropriate place on Canvas.
You will see a blank section of screen in the top-left of your RStudio window. This is where you will write your first R script.
The bottom-left of the screen has a tab called “Console.” This is basically a very fancy calculator.
Try the calculator by typing something like
Or even something fancier like
R makes extensive use of third-party packages. We won’t get into the details right now, but for this class, you will need to install a few of these. Installing packages is quite easy. Type the following two lines of code at the very top of your script:
install.packages("tidyverse", repos='http://cran.us.r-project.org') install.packages("modelsummary", repos='http://cran.us.r-project.org') install.packages("wooldridge", repos='http://cran.us.r-project.org')
You’ve just installed three packages. Basically, you’ve downloaded them onto your computer. Just like with other software on your computer, you only need to do the installation once. However, you still need to tell R that you will be using the packages. Add the following two lines of code to your script (below the first two lines you wrote). Notice how there are no quotation marks inside the parentheses this time.
library(tidyverse) library(modelsummary) library(wooldridge)
To execute the script, click on the word “Source” in the top-right corner of the top-left window pane. This will take what is in your script and automatically send it to the console (as if you typed it directly into the console)
To save the script, click on the disk icon at the top of your script pane (but not the disk icon at the very top of RStudio). Name your script
XYZ are your initials.
Now that you’ve got some of the basics of R, let’s look at some data!
We’re going to load a data set from the
wooldridge package. The data set is called
df <- as_tibble(wage1)
What we did there was convert it to a
tibble, which is a nice format for data sets (see Ch. 10 of Wickham and Grolemund (2017)). We called the converted tibble
df, but you can call it whatever you want:
If you re-run your script (by clicking
Source … or better yet, click on the little arrow next to
Source which opens a menu that you can click
Source with echo), you’ll see something new in the “Environment” window (top-right). It says
df under the “Data” heading.
df in the Environment window. This will show you your data in a format similar to an Excel spreadsheet. You can use this to easily scan the data to make sure things look reasonable.
Let’s look at the summary statistics of one of our variables. Suppose we want to know: What is the average years of education in our sample?
datasummary_skim(df, histogram=FALSE) #then look at 'educ' column
In your script, write the value of the Mean of
educ, preceded by a comment (the hashtag symbol)
What fraction of the sample is composed of women?
mean(df$female) # or datasummary_skim(df, histogram=FALSE) #then look at 'female' column
Suppose you want to visualize the entire distribution of education. You would use the following code:
ggplot(df, aes(educ)) + geom_histogram(binwidth=1)+theme_classic()
In a comment, write the most common value of education (the mode of the distribution) below the code in your script.
Repeat the previous two code snippets, but this time use the
wage variable instead of the
Suppose you want to add a new variable to
df. For example, the wage variable is expressed in 1976 dollars, and you want to know what the wage would be in today’s dollars. (Note: the CPI implies that $1.00 in 1976 equals $4.53 today)
df <- df %>% mutate(realwage=wage*4.53) summary(df$realwage) datasummary_skim(df, histogram=FALSE) # then look at realwage column
You can verify that
realwage got added to
df by clicking on the preview of
df and scrolling all the way over to the right.
For more information on
mutate(), see section 5.5 of Wickham and Grolemund (2017). You can also drop a variable by typing
df <- df %>% mutate(realwage=NULL)
Suppose we want to get rid of all men in our data. (For example, maybe we’re doing research on women’s labor force participation.) To do this, we use the
filter(), you provide conditions for keeping a specific observation. Add the following to your script:
summary(df$female) df <- df %>% filter(female==1) summary(df$female)
We told R to keep observations where
female equals 1. You can see that, prior to the
filter() statement, women were 48% of the data. Now, they are 100%. We can thus verify that
filter() did what we asked it to.
A common occurrence in cross-sectional observational data is missing values. For example, someone leaves blank a question in a survey. Or the wage of someone who is unemployed is not defined. In R, missing values are stored as
NA (meaning “not applicable”). To drop
NA observations, use the
summary(df$wage) df <- df %>% drop_na(wage) summary(df$wage)