Regression models partition variance, separating the variation in the outcomeintovariation explained by the predictors in our model and the remaining variation not explained by these predictors
Why do coefficients change when we control for variables?
Residualized Regression
Residualized regression is way of understanding what it means to control for variables in a regression.
Residualized regression provides a way of illustrating what we mean when say the coefficients describe the unique variance in Y explained by some predictor \(x\) (and only \(x\))
What’s a residual
Residuals represent the part of the outcome variable, not explained by the predictors in a model
Difference between the observed \(y\) and the predicted \(\hat{y}\)
For a model like m2 we can recover the coefficient on rep_voteshare_std by:
Regressing new_deaths_pc_14day on med_age_std to get the residual variation in Covid-19 deaths not explained by median age
Regressing rep_voteshare_std on med_age_std to get the residual variation in Republican Vote Share not explained by median age
Regressing the residuals from 1. (Deaths not explained by age) on the residuals from 2. (Vote share not explained by age) to obtain the same coefficient from m2 for rep_voteshare_std
The same principle holds for m3
# 1. Regressing `new_deaths_pc_14da` on `med_age_std`m2_death_by_age <-lm(new_deaths_pc_14day ~ med_age_std, covid_lab)# Save residualscovid_lab$res_death_no_age <-resid(m2_death_by_age)# 2. Regressing `rep_voteshare_std` on `med_age_std` m2_repvs_by_age <-lm(rep_voteshare_std ~ med_age_std, covid_lab)# Save residualscovid_lab$res_repvs_no_age <-resid(m2_repvs_by_age)# 3. Residualized regression of deaths on Rep Vote Sharem2_res <-lm(res_death_no_age ~ res_repvs_no_age, covid_lab)# Mutliple regressioncoef(m2)[2]
rep_voteshare_std
0.230745
# Residualized regressioncoef(m2_res)[2]
res_repvs_no_age
0.230745
# 1. Regressing `new_deaths_pc_14da` on `med_age_std` and med_income_stdm3_death_by_age_income <-lm(new_deaths_pc_14day ~ med_age_std + med_income_std, covid_lab)# Save residualscovid_lab$res_death_no_age_income <-resid(m3_death_by_age_income)# 2. Regressing `rep_voteshare_std` on `med_age_std` and med_income_stdm3_repvs_by_age_income <-lm(rep_voteshare_std ~ med_age_std + med_income_std, covid_lab)# Save residualscovid_lab$res_repvs_no_age_income <-resid(m3_repvs_by_age_income)# 3. Residualized regression of deaths on Rep Vote Sharem3_res <-lm(res_death_no_age_income ~ res_repvs_no_age_income, covid_lab)# multiple regression coefficientcoef(m3)[2]
rep_voteshare_std
0.07140446
# Same as residualized regression coefficientcoef(m3_res)[2]
res_repvs_no_age_income
0.07140446
Why did the coefficient on Rep Vote Share change in m3 but not m2?
Suppose we thought the marginal effect – (here, predicted change in deaths from a 1 percent increase in the percent of the population vaccinated) of vaccines varied.
There might be large gains from going to low to average rates of vaccination, but after a certain threshold, the decreases in deaths would taper off.
We could test this by including a polynomial termI(percent_vaccinated)^2 in our model.
Including a polynomial term, allows the marginal effect to vary, based on the value of the predictor.
It’s hard to interpret the coefficients on polynomial terms (or interaction terms) just by looking at coefficients in a table
Instead, we’ll produce a plot of predicted values to test these claims
When models are nested (larger models contain all the predictors of smaller models), we can ask, does including the additional predictors in the larger model explain more variation in the outcome than we would expect would happen if we just added additional, random variable.
In the 1800s, cholera was thought to be transmitted through the air.
John Snow (the physician, not the snack), to explore the origins eventunally concluding that cholera was transmitted through living organisms in water.
Leveraged a natural experiment in which one water company in London moved its pipes further upstream (reducing contamination for Lambeth), while other companies kept their pumps serving Southwark and Vauxhall in the same location.
Notation
Let’s adopt a little notation to help us think about the logic of Snow’s design:
\(D\): treatment indicator, 1 for treated neighborhoods (Lambeth), 0 for control neighborhoods (Southwark and Vauxhall)
\(T\): period indicator, 1 if post treatment (1854), 0 if pre-treatment (1849).
\(Y_{di}(t)\) the potential outcome of unit \(i\)
\(Y_{1i}(t)\) the potential outcome of unit \(i\) when treated between the two periods
\(Y_{0i}(t)\) the potential outcome of unit \(i\) when control between the two periods
Causal Effects
The individual causal effect for unit i at time t is:
\(D\) only equals 1, when \(T\) equals 1, so we never observe \(Y_0i(1)\) for the treated units.
In words, we don’t know what Lambeth’s outcome would have been in the second period, had they not been treated.
Average Treatment on Treated
Our goal is to estimate the average effect of treatment on treated (ATT):
\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]
That is, what would have happened in Lambeth, had their water company not moved their pipes
Average Treatment on Treated
Our goal is to estimate the average effect of treatment on treated (ATT):
We we can observe is:
Pre-Period (T=0)
Post-Period (T=1)
Treated \(D_{i}=1\)
\(E[Y_{0i}(0)\vert D_i = 1]\)
\(E[Y_{1i}(1)\vert D_i = 1]\)
Control \(D_i=0\)
\(E[Y_{0i}(0)\vert D_i = 0]\)
\(E[Y_{0i}(1)\vert D_i = 0]\)
Data
Because potential outcomes notation is abstract, let’s consider a modified description of the Snow’s cholera death data from Scott Cunningham:
Company
1849 (T=0)
1854 (T=1)
Lambeth (D=1)
85
19
Southwark and Vauxhall (D=0)
135
147
How can we estimate the effect of moving pumps upstream?
Recall, our goal is to estimate the effect of the the treatment on the treated:
\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]
Let’s conisder some strategies Snow could take to estimate this quantity:
Before vs after comparisons:
Snow could have compared Labmeth in 1854 \((E[Y_i(1)|D_i = 1] = 19)\) to Lambeth in 1849 \((E[Y_i(0)|D_i = 1]=85)\), and claimed that moving the pumps upstream led to 66 fewer cholera deaths.
Assumes Lambeth’s pre-treatment outcomes in 1849 are a good proxy for what its outcomes would have been in 1954 if the pumps hadn’t moved \((E[Y_{0i}(1)|D_i = 1])\).
A skeptic might argue that Lambeth in 1849 \(\neq\) Lambeth in 1854
Company
1849 (T=0)
1854 (T=1)
Lambeth (D=1)
85
19
Southwark and Vauxhall (D=0)
135
147
Treatment-Control comparisons in the Post Period.
Snow could have compared outcomes between Lambeth and S&V in 1954 (\(E[Yi(1)|Di = 1] − E[Yi(1)|Di = 0]\)), concluding that the change in pump locations led to 128 fewer deaths.
Here the assumption is that the outcomes in S&V and in 1854 provide a good proxy for what would have happened in Lambeth in 1954 had the pumps not been moved \((E[Y_{0i}(1)|D_i = 1])\)
Again, our skeptic could argue Lambeth \(\neq\) S&V
Company
1849 (T=0)
1854 (T=1)
Lambeth (D=1)
85
19
Southwark and Vauxhall (D=0)
135
147
Difference in Differences
To address these concerns, Snow employed what we now call a difference-in-differences design,
There are two, equivalent ways to view this design.
Difference 1: Average change between Treated over time
Difference 2: Average change between Control over time
Difference in Differences
You’ll see the DiD design represented both ways, but they produce the same result:
\[
\tau_{ATT} = (19-147) - (85-135) = -78
\]
\[
\tau_{ATT} = (19-85) - (147-135) = -78
\]
Identifying Assumption of a Difference in Differences Design
The key assumption in this design is what’s known as the parallel trends assumption: \(E[Y_{0i}(1) − Y_{0i}(0)|D_i = 1] = E[Y_{0i}(1) − Y_{0i}(0)|D_i = 0]\)
In words: If Lambeth hadn’t moved its pumps, it would have followed a similar path as S&V
In this week’s lab, we’ll be conducting a partial replication of Grumbach and Hill (2022) “Rock the Registration: Same Day Registration Increases Turnout of Young Voters.”
On Thursday, we’ll walk through
the paper’s design and argument
setting up and exploring the data
reproducing some descriptive figures
Next Thursday, we’ll focus on replicating and understanding the main results
General Structure of Labs 7-8
Lab 7:
Summarize the study
Download and load the data
Recode the data
Merge the data
Recreate Figures 1 and 2
Lab 8:
Estimate some baseline models to understand Two-Way Fixed Effects
Estimate some of the models in Figure 3
Extend the study, perhaps considering SDR by race or gender
Reading Grumbach and Hill (2022)
Reading Grumbach and Hill (2022), focus on being able to answer the following:
What’s the research question?
General RQ: First sentence, second paragraph, p. 405
Specific RQs: p. 405-406
What’s the theoretical framework?
Intro and Theory of Registration, p. 407-409
What’s the empirical design?
Methods pp. 409-410
What’s are the main results?
Results pp. 410-413
Figure 3 in particular
Q1: Download the replication files
Rather than downloading the files directly from the paper’s replication archives, in this lab, we will download the replication files to your computers and then load the data into R from where they’re saved
Please click here and let’s download the files together.
5. Save and unzip the downloaded files into your course folder where your labs are saved
Q3: Load the data into R
If you’ve saved the dataverse_files into the folder where your lab is saved, you should be able to run the following code after setting the working directory to source file location:
# Remember to set working directory:# Session > Set working directory > Source file location# Load fips_codesfips_codes <-read_csv("dataverse_files/fips_codes_website.csv")%>% janitor::clean_names()# Load policy datadata <-readRDS("dataverse_files/policy_data_updated.RDS")%>% janitor::clean_names()# Load CPS datacps <-read_csv("dataverse_files/cps_00021.csv") %>% janitor::clean_names()
Summary
Summary
References
Goodman-Bacon, Andrew. 2021. “Difference-in-differences with variation in treatment timing.”Journal of Econometrics 225 (2): 254–77.
Grumbach, Jacob M, and Charlotte Hill. 2022. “Rock the Registration: Same Day Registration Increases Turnout of Young Voters.”The Journal of Politics 84 (1): 405–17.