In words, this formula says, to calculate the average of x, we sum up all the values of \(x_i\) from observation \(i=1\) to \(i=n\) and then divide by the total number of observations \(n\)
Mean: Definitional
In this class, I don’t put a lot of weight on memorizing definitions (that’s what Google’s for).
But being comfortable with “the math” is important and useful
Definitional knowledge is a prerequisite for understanding more theoretical claims.
Mean: Theoretical
Suppose I asked you to show that the sum of deviations from a mean equals 0?
Showing the deviations sum to 0 is another way of saying the mean is a balancing point.
This turns out to be a useful property of means that will reappear throughout the course
If I asked you to make a prediction, \(\hat{x}\) of a random person’s height in this class, the mean would have the lowest mean squared error (MSE \(=\frac{1}{n}\sum (x_i - \hat{x_i})^2)\)
Mean: Theoretical
Occasionally, you’ll read or here me say say things like:
The sample mean is an unbiased estimator of the population mean
In a statistics class, we would take time to prove this.
The sample mean is an unbiased estimator of the population mean
Claim:
Let \(x_1, x_2, \dots x_n\) from a random sample from a population with mean \(\mu\) and variance \(\sigma^2\)
df %>%mutate(# Turn numeric values into factor labels Reincarnation = forcats::as_factor(reincarnation),# Order factor in decreasing frequency of levelsReincarnation = forcats::fct_infreq(Reincarnation),# Reverse order so levels are increasing in frequencyReincarnation = forcats::fct_rev(Reincarnation),# Rename explanationsWhy = reincarnation_why ) -> df
table(recode= df$Reincarnation, original = df$reincarnation)
df %>%# Data# Aestheticsggplot(aes(x = Reincarnation, fill = Reincarnation))+# Geometrygeom_bar(stat ="count")+# Statistic## Include levels of Reincarnation w/ no valuesscale_x_discrete(drop=FALSE)+# Don't include a legendscale_fill_discrete(drop=FALSE, guide="none")+# Flip x and ycoord_flip()+# Remove linestheme_classic() -> fig1
df %>%mutate(# Create numeric idid =1:n(),# Create a label with 3 answers and NA elsewhereLabel =case_when( id ==10~str_wrap(reincarnation_why[10],30), id ==20~str_wrap(reincarnation_why[20],30), id ==18~str_wrap(reincarnation_why[18],30),TRUE~NA_character_ ) ) -> df
# Calculate totals before calling ggplotplot_df <- df %>%group_by(Reincarnation)%>%summarise( Count =n(), Why =unique(Label) ) %>%ungroup() %>%# Kludge to get rid of NA rows...slice(c(1,2,3,5,7))
plot_df %>%ggplot(aes(x = Reincarnation, y = Count,fill = Reincarnation, label=Why))+geom_bar(stat ="identity")+#<<## Include levels of Reincarnation w/ no valuesscale_x_discrete(drop=FALSE)+# Don't include a legendscale_fill_discrete(drop=FALSE, guide="none")+coord_flip()+labs(x ="",y="",title="You're about to be reincarnated.\nWhat do you want to come back as?")+theme_classic()+ ggrepel::geom_label_repel(fill="white",nudge_y =1, hjust ="left",size=3,arrow =arrow(length =unit(0.015, "npc")) )+scale_y_continuous(breaks =c(0,2,4,6,8,10,12),expand =expansion(add =c(0,6)) ) -> fig1
Data visualization is an iterative process
Data visualization is an iterative process
Good data viz requires lots of data transformations
Start with a minimum working example and build from there
Don’t let the perfect be the enemy of the good enough.
Setup
New packages
This week’s lab we’ll be using the dataverse package to download data on presidential elections
Next week’s lab, we’ll be using the tidycensus package to download census data.
We’ll also need to install a census API to get the data.
Here’s a detailed guide of what we’ll do in class right now.
Conceptually, this lab is designed to help reinforce the relationship between linear models like \(y=\beta_0 + \beta_1x\) and the conditional expectation function \(E[Y|X]\).
Substantively, we will explore whether David Leonhardt’s claims about Red Covid the political polarization of vaccines and its consequences
Lab: Questions 1-5: Review
Questions 1-5 are designed to reinforce your data wrangling skills. In particular, you will get practice:
Creating and recoding variables using mutate()
Calculating a moving average or rolling mean using the rollmean() function from the zoo package
Transforming the data on presidential elections so that it can be merged with the data on Covid-19 using the pivot_wider() function.
Merging data together using the left_join() function.
Lab: Questions 6-10: Simple Linear Regression
In question 6, you will see how calculating conditional means provides a simple test of “Red Covid” claim.
In question 7, you will see how a linear model returns the same information as these conditional means (in a sligthly different format)
In question 8, you will get practice interpreting linear models with continuous predictors (i.e. predictors that take on a range of values)
In question 9, you will get practice visualizing these models and using the figures help interpret your results substantively.
Question 10 asks you to play the role of a skeptic and consider what other factors might explain the relationships we found in Questions 6-9. We will explore these factors in next week’s lab.
Guidance
The following slides provide detailed explanations of all the code you’ll need for each question.
Q2.2. asks you to write code that will download data presidential elections from 1976 to 2020 from the MIT Election Lab’s dataverse
Once you’ve installed the dataverse package you should be able to do this:
# Try this code firstSys.setenv("DATAVERSE_SERVER"="dataverse.harvard.edu")pres_df <- dataverse::get_dataframe_by_name("1976-2020-president.tab","doi:10.7910/DVN/42MVDX")# If the code above fails, comment out and uncomment the code below:# load(url("https://pols1600.paultesta.org/files/data/pres_df.rda"))
Q3 Describe the structure of each dataset
Question 3 asks you to describe the structure of each dataset.
Specifically, it asks you to get a high level overview of covid and pres_df and describe the unit of analysis in each dataset:
Describe substantively what specific, observation each row in the dataset corresponds to
In covid covid dataset, the unit of analysis is a state-date
Q3 Describe the structure of each dataset
Here’s some possible code you could use to get a quick HLO of each dataset:
# check names in `covid`names(covid)# take a quick look values of each variableglimpse(covid)# Look at first few observations for:# date, administrative_area_level_2, covid %>%select(date, administrative_area_level_2) %>%head()# Summarize data to get a better sense of the unit of observastioncovid %>%group_by(administrative_area_level_2) %>%summarise(n =n(), # Number of observations for each statestart_date =min(date, na.rm = T),end_date =max(date, na.rm=T) ) -> hlo_covid_dfhlo_covid_df# How many unique values of date and state are their:n_dates <-length(unique(covid$date))n_states <-length(unique(covid$administrative_area_level_2))n_datesn_states# If we had observations for every state on every date then the number of rows # in the data dim(covid)[1]# Should equaldim(covid)[1] == n_dates * n_states# This is what economists would call an unbalanced panel
# check names in `pres_df`names(pres_df)# take a quick look values of each variableglimpse(pres_df)# Unit of analysis is a year-state-candidatepres_df %>%select(year, state_po, candidate) %>%head()# How many states?length(unique(pres_df$state_po))# How many candidates and parties on the ballot in a given election yearpres_df %>%group_by(year) %>%summarise(n_candidates =length(unique(candidate)),# Look at both party_detailed and party_simplifiedn_parties_detailed =length(unique(party_detailed)),n_parties_simplified =length(unique(party_simplified)) ) -> hlo_pres_dfhlo_pres_df# Look at 2020# pres_df$candidate[pres_df$year == "2020"]
Q4 Recode the data for analysis
Using our understanding of the structure of the data, Q4 asks you to:
Recode the Covid-19 data like we’ve done before plus
Calculate rolling means, 7 and 14 day averages
Reshape, recode, and filter the presidential election data
This is the same code we’ve used before to create covid_us from covid with the addition of code to calculate a rolling mean or moving average of the number of new cases
# Create a vector containing of US territoriesterritories <-c("American Samoa","Guam","Northern Mariana Islands","Puerto Rico","Virgin Islands" )# Filter out Territories and create state variablecovid_us <- covid %>%filter(!administrative_area_level_2 %in% territories)%>%mutate(state = administrative_area_level_2 )# Calculate new cases, new cases per capita, and 7-day averagecovid_us %>% dplyr::group_by(state) %>%mutate(new_cases = confirmed -lag(confirmed),new_cases_pc = new_cases / population *100000,new_cases_pc_7da = zoo::rollmean(new_cases_pc, k =7, align ="right",fill=NA ) ) -> covid_us# Recode facemask policycovid_us %>%mutate(# Recode facial_coverings to create face_masksface_masks =case_when( facial_coverings ==0~"No policy",abs(facial_coverings) ==1~"Recommended",abs(facial_coverings) ==2~"Some requirements",abs(facial_coverings) ==3~"Required shared places",abs(facial_coverings) ==4~"Required all times", ),# Turn face_masks into a factor with ordered policy levelsface_masks =factor(face_masks,levels =c("No policy","Recommended","Some requirements","Required shared places","Required all times") ) ) -> covid_us# Create year-month and percent vaccinated variablescovid_us %>%mutate(year =year(date),month =month(date),year_month =paste(year, str_pad(month, width =2, pad=0), sep ="-"),percent_vaccinated = people_fully_vaccinated/population*100 ) -> covid_us
# Calculate new cases, new cases per capita, and 7-day averagecovid_us %>% dplyr::group_by(state) %>%mutate(new_cases = confirmed -lag(confirmed),new_cases_pc = new_cases / population *100000,new_cases_pc_7day = zoo::rollmean(new_cases_pc, k =7, align ="right",fill=NA ) ) -> covid_us
covid_us %>%filter(date >"2020-03-05", state =="Minnesota") %>%select(date, new_cases_pc, new_cases_pc_7day)%>%ggplot(aes(date,new_cases_pc ))+geom_line(aes(col="Daily"))+# set y aesthetic for second line of rolling averagegeom_line(aes(y = new_cases_pc_7day,col ="7-day average") ) +theme(legend.position="bottom")+labs( col ="Measure",y ="New Cases Per 100k", x ="",title ="Minnesota" ) -> fig_covid_mn
Q5 asks you to merge the 2020 election data from pres2020_df into covid_us using the common state variable in each dataset using the function left_join()
Make sure the values state are the same in each dataset
Check for differences in spelling, punctuation, etc.
Check the dimensions of output of your left_join()
If there is a 1-1 match the number of rows should be the same before after
Tip
In general (although not in my sample code), you should save the merged data to a new object in R. Saving it back into an existing object can cause issues if you run the merge code multiple times by mistake.
# Should be 51 states and DC in eachsum(unique(pres_df$state) %in% covid_df$state)
[1] 0
# Look at each state variable## With [] indexpres_df$state[1:5]
# Matching is case sensitive # make pres_df$state title case## Base R:pres_df$state <-str_to_title(pres_df$state )## Tidy R:pres_df %>%mutate(state =str_to_title(state ) ) -> pres_df# Should be 51sum(unique(pres_df$state) %in% covid_us$state)
[1] 50
# Find the mismatch:unique(pres_df$state[!pres_df$state %in% covid_us$state])
[1] "District Of Columbia"
# Two equivalent ways to fix this mismatch## Base R: Quick fix to change spelling of DCpres_df$state[pres2020_df$state =="District Of Columbia"] <-"District of Columbia"## Tidy R: Quick fix to change spelling of DCpres_df %>%mutate(state =ifelse(test = state =="District Of Columbia",yes ="District of Columbia",no = state ) ) -> pres_df# Problem Solvedsum(unique(pres2020_df$state) %in% covid_df$state)
[1] 51
Causal Inference
Causal inference is about counterfactual comparisons
Causal inference is about counterfactual comparisons
What would have happened if some aspect of the world either had or had not been present
Causal Identification
Casual Identification refers to “the assumptions needed for statistical estimates to be given a causal interpretation” Keele (2015)]
What do we need to assume to make our claims about cause and effect credible
Experimental Designs rely on randomization of treatment to justify their causal claims
Observational Designs require additional assumptions and knowledge to make causal claims
Experimental Designs
Experimental designs are studies in which a causal variable of interest, the treatement, is manipulated by the researcher to examine its causal effects on some outcome of interest
Random assignment is the key to causal identification in experiments because it creates statistical independence between treatment and potential outcomes any potential confounding factors
Observational designs are studies in which a causal variable of interest is determined by someone/thing other than the researcher (nature, governments, people, etc.)
Since treatment has not been randomly assigned, observational studies typically require stronger assumptions to make causal claims.
Generally speaking, these assumptions amount to a claim about conditional independence
Where after conditioning on \(K_i\), some knowledge about the world and how the data were generated, our treatment is as good as (as-if) randomly assigned (hence conditionally independent)
Economists often call this assumption of selection on observables
Causal Inference in Observational Studies
To understand how to make causal claims in observational studies we will:
Introduce the concept of Directed Acyclic Graphs to describe causal relationships
Discuss three approaches to covariate adjustment
Subclassification
Matching
Linear Regression
Three research designs for observational data
Differences-in-Differences
Regression Discontinuity Designs
Instrumental Variables
Directed Acyclic Graphs
Two Ways to Describe Causal Claims
In this course, we will use two forms of notation to describe our causal claims.
Potential Outcomes Notation (last lecture)
Illustrates the fundamental problem of causal inference
Directed Acyclic Graphs (DAGs)
Illustrates potential bias from confounders and colliders
Directed Acyclic Graphs
Directed Acyclic Graphs provide a way of encoding assumptions about casual relationships
Directed Arrows \(\to\) describe a direct causal effect
Arrow from \(D\to Y\) means \(Y_i(d) \neq Y_i(d^\prime)\) “The outcome ( \(Y\)) for person \(i\) when D happens ( \(Y_i(d)\) ) is different than the the outcome when \(D\) doesn’t happen ( \(Y_i(d^\prime)\) )
No arrow = no effect ( \(Y_i(d) = Y_i(d^\prime)\) )
Covariate adjustment refers a broad class of procedures that try to make a comparison more credible or meaningful by adjusting for some other potentially confounding factor.
Covariate Adjustment
When you hear people talk about
Controlling for age
Conditional on income
Holding age and income constant
Ceteris paribus (All else equal)
They are typically talking about some sort of covariate adjustment.
Three approaches to covariate adjustment
Subclassification
Matching
Regression
Causal Identification through Subclassification
Motivation: Treatment, \(D\) is not randomly assigned
The average treatment effect is identified by the observed difference of means between treatment and control conditional on the values of \(X\)
Causal Identification through Subclassification
Economists call \(Y_i(1),Y_i(0) \perp D_i |X_i\) an assumption of Selection on Observables
Controlling for what we can observe, \(X\), \(D\) is conditionally independent of Potential Outcomes
Violated if there were some other factor, \(U\) that influenced both \(D\) and \(Y\) (i.e. \(U\) is a confounder)
\(0 < Pr(D = 1|X) < 1\) is called an assumption of Common Support
There is a non-zero probability of receiving the treatment for all values of X
Violated if only one subgroup had access to the treatment (e.g. Vaccine by age group comparisons)
Example of Subclassification
We used subclassification when we compared the unconditional rates of new Covid-19 cases by face mask policy to the conditional rates new cases by policy regime in each month of our data.
Overall rates are misleading.
Lots of things differ between January 2020 and January 2022
Subclassification by month provides a “fairer” comparison
But is it “causal”
Limits of Subclassification
Even controlling for “month” there are other omitted variables:
Other policies in place?
Socio-economic differences between states
Others?
Trying to subclassify (stratify) comparisons on more than one or two variables gets hard
The Curse of Dimensionality
The Curse of Dimensionality
As we try to control for more factors, the number of observations per dimension declines rapidly
Men vs Women
Men, ages 20-30 vs Men ages 30-40
Men, ages 20-30 with college degrees and blue eyes vs Men ages 20-30 with college degrees and green eyes
Subclassification with more than a few variables, will often produce a lack of common support:
Not enough observations to make credible counterfactual comparisons
Matching
Matching refers to a broad set of procedures that essentially try to generalize subclassification to
address to curse of dimensionality
achieve balance on a range of observable covariates between treated and control groups
Matching
Different types of matching procedures:
Exact matching: Find exact matches between treatment and control observations for all covariates \(X\)
Coarsened exact matching: Find approximate matches within ranges of values for \(X\)
Distance-metric matching: Calculate a distance metric between observations based on their values of \(X\), and match treated and control to minimize that distance
Propensity score matching: Calculate the propsenity to receive treatment using \(X\) to predict \(D\) and treated and control based on their propensity scores
Matching procedures like propensity score matching, allow us to match treated and control observations based on a propensity score, a predicted value of receiving the treatment, \(D\) based on observed variables, \(X\).
\[
p(X_i) = Pr(D=1|X_i) = \pi_i
\]
Allowing us to estimate an ATE conditional on \(\pi_i\)
The mechanics of matching are beyond the scope of this course
Just think of it as a generalization of subclassification when we want to condition on multiple variables
“Solves” the curse of dimensionality, creating Treatment-Control comparisons between groups that are similar on observed covariates
But no guarantee that matching produces balance on unobserved covariates.
Regression
We will spend the next two weeks talking in detail about regression, in general and linear regression in particular.
Today we’ll introduce some basic notation and simple examples
Conceptually, think of regression as
a tool to make predictions
by fitting lines to data
Theoretically, we will build towards an understanding of linear regression as a “linear estimate of the conditional expectation function \((CEF = E[Y|X])\)
Three approaches to covariate adjustment
Subclassification
👍: Easy to implement and interpret
👎: Curse of dimensionality, Selection on observables,
Matching
👍: Balance on multiple covariates, Mirrors logic of experimental design, Fewer functional form assumptions
👎: Selection on observables, Only provides balance on observed variables, Lot’s of technical details…
Regression
👍: Easy to implement, control for many factors (good and bad)
👎: Selection on observables, Assumes a linear functional form, easy to fit “bad” models
Simple Linear Regression
Understanding Linear Regression
Conceptual
Simple linear regression estimates “a line of best fit” that summarizes relationships between two variables
\[
y_i = \beta_0 + \beta_1x_i + \epsilon_i
\]
Practical
We estimate linear models in R using the lm() function
lm(y ~ x, data = df)
Understanding Linear Regression
Technical/Definitional
Linear regression chooses \(\beta_0\) and \(\beta_1\) to minimize the Sum of Squared Residuals (SSR):
Linear regression provides a linear estimate of the conditional expectation function (CEF): \(E[Y|X]\)
Conceptual: Linear Regression
Conceptual: Linear Regression
Regression is a tool for describing relationships.
How does some outcome we’re interested in tend to change as some predictor of that outcome changes?
How does economic development vary with democracy?
How does economic development vary with democracy, adjusting for natural resources like oil and gas
Conceptual: Linear Regression
More formally:
\[
y_i = f(x_i) + \epsilon
\]
Y is a function of X plus some error, \(\epsilon\)
Linear regression assumes that relationship between an outcome and a predictor can be by a linear function
\[
y_i = \beta_0 + \beta_1 x_i + \epsilon
\]
Linear Regression and the Line of Best Fit
The goal of linear regression is to choose coefficients \(\beta_0\) and \(\beta_1\) to summarizes the relationship between \(y\) and \(x\)
\[
y_i = \beta_0 + \beta_1 x_i + \epsilon
\]
To accomplish this we need some sort of criteria.
For linear regression, that criteria is minimizing the error between what our model predicts \(\hat{y_i} = \beta_0 + \beta_1 x_i\) and what we actually observed \((y_i)\)
More on this to come. But first…
Regression Notation
\(y_i\) an outcome variable or thing we’re trying to explain
AKA: The dependent variable, The response Variable, The left hand side of the model
\(x_i\) a predictor variables or things we think explain variation in our outcome
AKA: The independent variable, covariates, the right hand side of the model.
Cap or No Cap: I’ll use \(X\) (should be \(\mathbf{X}\)) to denote a set (matrix) of predictor variables. \(y\) vs \(Y\) can also have technical distinctions (Sample vs Population, observed value vs Random Variable, …)
\(\beta\) a set of unknown parameters that describe the relationship between our outcome \(y_i\) and our predictors \(x_i\)
\(\epsilon\) the error term representing variation in \(y_i\) not explained by our model.
Technically \(\epsilon\) refers to theoretical error inherent to the data generating process, while \(\hat{\epsilon}_i\) or \(u_i\) is used to refer to residuals, an estimated error that reflects the difference between the observed (\(y_i\)) and predicted (\(\hat{y}_i\)) values.
Linear Regression
We call this a bivariate regression, because there are only two variables
\[
y_i = \beta_0 + \beta_1 x_i + \epsilon
\]
We call this a linear regression, because \(y_i = \beta_0 + \beta_1 x_i\) is the equation for a line, where:
\(\beta_0\) corresponds to the \(y\) intercept, or the model’s prediction when \(x = 0\).
\(\beta_1\) corresponds to the slope, or how \(y\) is predicted to change as \(x\) changes.
Linear Regression
If you find this notation confusing, try plugging in substantive concepts for what \(y\) and \(x\) represent
Say we wanted to know how attitudes to transgender people varied with age in the baseline survey from Lab 03.
P: By minimizing the sum of squared errors, in procedure called Ordinary Least Squares (OLS) regression
Q: Ok, that’s not really that helpful…
What’s an error?
Why would we square and sum them
How do we minimize them.
P: Good questions!
What’s an error?
An error, \(\epsilon_i\) is simply the difference between the observed value of \(y_i\) and what our model would predict, \(\hat{y_i}\) given some value of \(x_i\). So for a model:
\[y_i=\beta_0+\beta_1 x_{i} + \epsilon_i\]
We simply subtract our model’s prediction \(\beta_0+\beta_1 x_{i}\) from the the observed value, \(y_i\)
In an intro stats course, we would walk through the process of finding
\[\textrm{Find }\hat{\beta_0},\,\hat{\beta_1} \text{ arg min}_{\beta_0, \beta_1} \sum (y_i-(\beta_0+\beta_1x_i))^2\] Which involves a little bit of calculus. The big payoff is that
\[\beta_0 = \bar{y} - \beta_1 \bar{x}\] And
\[ \beta_1 = \frac{Cov(x,y)}{Var(x)}\] Which is never quite the epiphany, I think we think it is…
The following slides walk you through the mechanics of this exercise. We’re gonna skip through them in class, but they’re there for your reference
How do we minimize \(\sum \epsilon^2\)
To understand what’s going on under the hood, you need a broad understanding of some basic calculus.
The next few slides provide a brief review of derivatives and differential calculus.
Derivatives
The derivative of \(f\) at \(x\) is its rate of change at \(x\)
For a line: the slope
For a curve: the slope of a line tangent to the curve
The derivative of the “outside” times the derivative of the “inside,” remembering that the derivative of the outside function is evaluated at the value of the inside function.
Finding a Local Minimums
Local minimum:
\[
f^{\prime}(x)=0 \text{ and } f^{\prime\prime}(x)>0
\]
We solve for \(\beta_0\) and \(\beta_1\), by taking the partial derivatives with respect to \(\beta_0\) and \(\beta_1\), and setting them equal to zero
So the coefficient in a simple linear regression of \(Y\) on \(X\) is simply the ratio of the covariance between \(X\) and \(Y\) over the variance of \(X\). Neat!
Theoretical: OLS provides a linear estimate of CEF: E[Y|X]
Linear Regression is a many splendored thing
Timothy Lin provides a great overview of the various interpretations/motivations for linear regression.
Linear regression provides a linear estimate of the conditional expectation function (CEF): \(E[Y|X]\)
Difference-in-Differences
Motivating Example: What causes Cholera?
In the 1800s, cholera was thought to be transmitted through the air.
John Snow (the physician, not the snack), to explore the origins eventunally concluding that cholera was transmitted through living organisms in water.
Leveraged a natural experiment in which one water company in London moved its pipes further upstream (reducing contamination for Lambeth), while other companies kept their pumps serving Southwark and Vauxhall in the same location.
Notation
Let’s adopt a little notation to help us think about the logic of Snow’s design:
\(D\): treatment indicator, 1 for treated neighborhoods (Lambeth), 0 for control neighborhoods (Southwark and Vauxhall)
\(T\): period indicator, 1 if post treatment (1854), 0 if pre-treatment (1849).
\(Y_{di}(t)\) the potential outcome of unit \(i\)
\(Y_{1i}(t)\) the potential outcome of unit \(i\) when treated between the two periods
\(Y_{0i}(t)\) the potential outcome of unit \(i\) when control between the two periods
Causal Effects
The individual causal effect for unit i at time t is:
\(D\) only equals 1, when \(T\) equals 1, so we never observe \(Y_0i(1)\) for the treated units.
In words, we don’t know what Lambeth’s outcome would have been in the second period, had they not been treated.
Average Treatment on Treated
Our goal is to estimate the average effect of treatment on treated (ATT):
\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]
That is, what would have happened in Lambeth, had their water company not moved their pipes
Average Treatment on Treated
Our goal is to estimate the average effect of treatment on treated (ATT):
We we can observe is:
Pre-Period (T=0)
Post-Period (T=1)
Treated \(D_{i}=1\)
\(E[Y_{0i}(0)\vert D_i = 1]\)
\(E[Y_{1i}(1)\vert D_i = 1]\)
Control \(D_i=0\)
\(E[Y_{0i}(0)\vert D_i = 0]\)
\(E[Y_{0i}(1)\vert D_i = 0]\)
Data
Because potential outcomes notation is abstract, let’s consider a modified description of the Snow’s cholera death data from Scott Cunningham:
Company
1849 (T=0)
1854 (T=1)
Lambeth (D=1)
85
19
Southwark and Vauxhall (D=0)
135
147
How can we estimate the effect of moving pumps upstream?
Recall, our goal is to estimate the effect of the the treatment on the treated:
\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]
Let’s conisder some strategies Snow could take to estimate this quantity:
Before vs after comparisons:
Snow could have compared Labmeth in 1854 \((E[Y_i(1)|D_i = 1] = 19)\) to Lambeth in 1849 \((E[Y_i(0)|D_i = 1]=85)\), and claimed that moving the pumps upstream led to 66 fewer cholera deaths.
Assumes Lambeth’s pre-treatment outcomes in 1849 are a good proxy for what its outcomes would have been in 1954 if the pumps hadn’t moved \((E[Y_{0i}(1)|D_i = 1])\).
A skeptic might argue that Lambeth in 1849 \(\neq\) Lambeth in 1854
Company
1849 (T=0)
1854 (T=1)
Lambeth (D=1)
85
19
Southwark and Vauxhall (D=0)
135
147
Treatment-Control comparisons in the Post Period.
Snow could have compared outcomes between Lambeth and S&V in 1954 (\(E[Yi(1)|Di = 1] − E[Yi(1)|Di = 0]\)), concluding that the change in pump locations led to 128 fewer deaths.
Here the assumption is that the outcomes in S&V and in 1854 provide a good proxy for what would have happened in Lambeth in 1954 had the pumps not been moved \((E[Y_{0i}(1)|D_i = 1])\)
Again, our skeptic could argue Lambeth \(\neq\) S&V
Company
1849 (T=0)
1854 (T=1)
Lambeth (D=1)
85
19
Southwark and Vauxhall (D=0)
135
147
Difference in Differences
To address these concerns, Snow employed what we now call a difference-in-differences design,
There are two, equivalent ways to view this design.
Difference 1: Average change between Treated over time
Difference 2: Average change between Control over time
Difference in Differences
You’ll see the DiD design represented both ways, but they produce the same result:
\[
\tau_{ATT} = (19-147) - (85-135) = -78
\]
\[
\tau_{ATT} = (19-85) - (147-135) = -78
\]
Identifying Assumption of a Difference in Differences Design
The key assumption in this design is what’s known as the parallel trends assumption: \(E[Y_{0i}(1) − Y_{0i}(0)|D_i = 1] = E[Y_{0i}(1) − Y_{0i}(0)|D_i = 0]\)
In words: If Lambeth hadn’t moved its pumps, it would have followed a similar path as S&V
Parralel Trends
Summary
A Difference in Differences (DiD, or diff-in-diff) design combines a pre-post comparison, with a treated and control comparison
Taking the pre-post difference removes any fixed differences between the units
Then taking the difference between treated and control differences removes any common differences over time
The key identifying assumption of a DiD design is the “assumption of parallel trends”
Absent treatment, treated and control groups would see the same changes over time.
Hard to prove, possible to test
Extensions and limitations
Diff-in-Diff easy to estimate with linear regression
Generalizes to multiple periods and treatment interventions
More pre-treatment periods allow you assess “parallel trends” assumption
Alternative methods
Synthetic control
Event Study Designs
What if you have multiple treatments or treatments that come and go?
Blair, Graeme, Alexander Coppock, and Macartan Humphreys. 2023. Research Design in the Social Sciences: Declaration, Diagnosis, and Redesign. Princeton University Press.