The purpose of this in-class lab is to use R to practice with difference-in-differences. To get credit, upload your .R script to the appropriate place on Canvas.
Open up a new R script (named ICL16_XYZ.R
, where XYZ
are your initials) and add the usual “preamble” to the top:
library(tidyverse)
library(wooldridge)
library(broom)
library(magrittr)
library(modelsummary)
library(estimatr)
Our data set will be a pooled cross section of workers in Kentucky and Michigan, as analyzed by Meyer, Viscusi, and Durbin (1995). Load the data from the wooldridge
package and restrict to those living in Kentucky:
df <- as_tibble(injury)
df %<>% filter(ky==1)
In 1980, Kentucky raised its cap on weekly earnings that were covered by worker’s compensation.1 The outcome variable is ldurat
, which is the log duration (in weeks) of worker’s compensation benefits. The policy was such that the cap increase did not affect low-earnings workers, but did affect high-earnings workers. Thus, low-earnings workers serve as the control group, while high-earnings workers serve as the treatment group.
We are interested to know if the policy caused workers to spend more time out of work. If benefits are not generous enough, then workers may sue the company for on-the-job injuries. On the other hand, benefits that are too generous may induce workers to be more reckless on the job, or to claim that off-the-job injuries were incurred while at work.
In difference-in-differences settings, it is often helpful to see if there even is a policy effect. Let’s compute the difference-in-differences (with no regression) based on the pre- and post-means for the treatment and control groups
df %>% group_by(afchnge,highearn) %>% summarize(mean.ldurat = mean(ldurat))
The basic regression model to analyze the policy’s impact is \[ \log(durat) = \beta_0 + \delta_0 afchnge + \beta_1 highearn + \delta_1 afchnge \cdot highearn + u \]
Estimate this model, calling it est.did
. I suppress the code here, since this should be old hat by now. (Hint: recall that, for two dummy variables x1
and x2
, putting x1*x2
in the lm()
formula will include in the formula each dummy and the interaction of the two.)
We might want to control for other aspects of our workers. For example, perhaps claims made by construction or manufacturing workers tend to have longer duration than claims made workers in other industries. Or maybe those claiming back injuries tend to have longer claims than those claiming head injuries. One may also want to control for worker demographics such as gender, marital status, and age.
Estimate an expanded version of the basic regression model, where the following additional variables are included:
male
married
age
hosp
(1 = hospitalized)indust
(1 = manuf, 2 = construc, 3 = other)injtype
(1-8; categories for different types of injury)lprewage
(log of wage prior to filing a claim)Be sure to format indust
and injtyp
as factors before proceeding. Call your model est.did.x
.
Calling modelsummary
can help view the results:
modelsummary(list(est.did,est.did.x),output="latex")
How different is your estimate of \(\hat{\delta}_1\) from that in (2)?
Do you see any interesting patterns among the \(x\)’s in predicting log duration of benefits?