```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, results = 'hide', fig.keep = 'none')
```

The purpose of this in-class lab is to use R to practice estimating time series regression models with standard errors corrected for heteroskedasticity and serial correlation (HAC). To get credit, upload your .R script to the appropriate place on Canvas.

## For starters
First, install the `pdfetch`, `tsibble`, `sandwich`, and `COVID19` packages. `pdfetch` stands for "Public Data Fetch" and is a slick way of downloading statistics on stock prices, GDP, inflation, unemployment, etc. `tsibble` is a package useful for working with time series data. It is the "tibble" for time series data. `sandwich` is helpful for obtaining HAC standard errors. `COVID19` pulls up-to-date data on COVID-19 cases, deaths, etc.

Open up a new R script (named `ICL10_XYZ.R`, where `XYZ` are your initials) and add the usual "preamble" to the top:
```{r message=FALSE, warning=FALSE, paged.print=FALSE}
# Add names of group members HERE
library(tidyverse)
library(wooldridge)
library(modelsummary)
library(broom)
library(car)
library(magrittr)
library(lmtest)
# new packages installed today:
library(pdfetch)
library(sandwich)
library(tsibble)
library(COVID19)
```

### Load the data
We're going to use data on US COVID-19 cases, deaths and other information.

```{r}
df <- covid19(c("US"))
df.ts <- as_tsibble(df, key=id, index=date)
```

Now it will be easy to include lags of various variables into our regression models.

## Plot time series data
Let's have a look at the data on **new** daily cases and deaths:
```{r}
df.ts %<>% mutate(new_deaths = difference(deaths),    # data comes in cumulative format
                  new_tests  = difference(tests),     # "difference" converts it to
                  new_cases  = difference(confirmed)) # "new cases" etc.

# plots
ggplot(df.ts, aes(date, new_cases)) + geom_line()
ggplot(df.ts, aes(date, new_deaths)) + geom_line()

# plots with 7-day rolling average
ggplot(df.ts, aes(date, new_cases)) + geom_line(aes(y=rollmean(new_cases, 7, na.pad=TRUE)))
ggplot(df.ts, aes(date, new_deaths)) + geom_line(aes(y=rollmean(new_deaths, 7, na.pad=TRUE)))
```

## Determinants of US COVID-19 cases
Now let's estimate the following regression model:
\[
\log(new\_cases_{t}) = \beta_0 + \beta_1 gath_t + \beta_2 gath_{t-7} + \beta_3 gath_{t-14} + \beta_4 \log(new\_cases_{t-7}) + u_t
\]
where $new\_cases$ is the number of new COVID cases, and $gath$ is a variable taking on values 0--4 representing severity of gatherings restricitons.

```{r}
df.ts %<>% mutate(log.new.cases = log(new_cases),
                  log.new.cases = replace(log.new.cases,new_cases==0,NA_real_)) # since log(0) = -Inf
est <- lm(log.new.cases ~ gatherings_restrictions + lag(gatherings_restrictions,7) + 
                          lag(gatherings_restrictions,14) + lag(log.new.cases,7), data=df.ts)
```

1. Are any of these variables significant determinants of new COVID cases? If so, which ones?

## Correcting for Serial Correlation
Now let's compute HAC (Heteroskedasticity and Autocorrelation Consistent) standard errors. To do so, we'll use the `vcov` option in the `modelsummary()` function along with the `NeweyWest` function from the `sandwich` package.

```{r}
modelsummary(est) # re-display baseline results
modelsummary(est, vcov=sandwich::NeweyWest(est))
```

or, putting them side-by-side:

```{r}
modelsummary(list(est,est), 
             vcov=list("iid",sandwich::NeweyWest(est))
            )
```

2. How does your interpretation of the the effect of gathering restrictions change after using the Newey-West standard errors?

## Oklahoma COVID cases
```{r}
df.ok <- covid19(c("US"),level=2) %>% 
         filter(administrative_area_level_2=="Oklahoma")
df.ts.ok <- as_tsibble(df.ok, key=id, index=date)
df.ts.ok %<>% mutate(new_deaths = difference(deaths),
                     new_tests  = difference(tests),
                     new_cases  = difference(confirmed))
ggplot(df.ts.ok, aes(date, new_cases)) + geom_line(aes(y=rollmean(new_cases, 7, na.pad=TRUE)))
ggplot(df.ts.ok, aes(date, new_deaths)) + geom_line(aes(y=rollmean(new_deaths, 7, na.pad=TRUE)))

df.ts.ok %<>% mutate(log.new.cases = log(new_cases),
                     log.new.cases = replace(log.new.cases,new_cases==0,NA_real_)) # since log(0) = -Inf
est.ok <- lm(log.new.cases ~ gatherings_restrictions + lag(gatherings_restrictions,7) + 
                          lag(gatherings_restrictions,14) + lag(log.new.cases,7), data=df.ts.ok)
```

With HAC standard errors:

```{r}
modelsummary(list(est.ok,est.ok),
             vcov=list("iid",sandwich::NeweyWest(est.ok)))
```

## Traditional macroeconomic model of interest rates and inflation
### Load the data
We can also use data on US macroeconomic indicators. The `wooldridge` data set is called `intdef`.

```{r}
df.ts <- as_tsibble(intdef, key=NULL, index=year)
```

Now it will be easy to include lags of various variables into our regression models.

## Plot time series data
Let's have a look at the inflation rate for the US over the period 1948--2003:
```{r}
ggplot(df.ts, aes(year, inf)) + geom_line()
```

## Determinants of the interest rate
Now let's estimate the following regression model:
\[
i3_{t} = \beta_0 + \beta_1 inf_t + \beta_2 inf_{t-1} + \beta_3 inf_{t-2} + \beta_4 def_{t} + u_t
\]
where $i3$ is the 3-month Treasury Bill interest rate, $inf$ is the inflation rate (as measured by the CPI), and $def$ is the budget deficit as a percentage of GDP.

```{r}
est <- lm(i3 ~ inf + lag(inf,1) + lag(inf,2) + def, data=df.ts)
modelsummary(list(est,est), vcov=list("iid",sandwich::NeweyWest(est)))
```

[^1]: You may need to install the `sandwich` package.