POLS 1600

Multiple Regression

Updated May 31, 2024

Overview

Class Plan

  • Announcements (20)
  • Feedback
  • Review: Simple Linear Regression and Lab 5 (15-20 min)
  • Preview: Setup for Lab 6 (10-15 min)
  • Estimating and Interpreting Multiple Regression (25-30 min)
  • Difference-in-Differences (15-20 min)

Annoucements

Assignment 1

  • Prompt posted on website here

  • 3 possible research questions uploaded to Canvas by Next Tuesday, March 12

  • The ideal experiment = “What would you have to randomly assign to create a meaningful counterfactual comparison”

  • Observational study = “What could you say with data that likely exist?”

  • Feedback will help you narrow down a topic for final project

Assignment 2

  • Prompt posted on website [here]{https://pols1600.paultesta.org/assignments/a2}

  • Uploaded to Canvas by Friday, March 22

  • Revised Research Question

  • Implied Linear Model

  • Code to load possible data

Feedback

  • Only 9 people completed the survey last week…

  • So let’s try to get those numbers up and will summarize the results of the survey next week

Packages for today

## Pacakges for today
the_packages <- c(
  ## R Markdown
  "kableExtra","DT","texreg",
  ## Tidyverse
  "tidyverse", "lubridate", "forcats", "haven", "labelled",
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr", 
  "GGally", "scales", "dagitty", "ggdag", "ggforce",
  # Data 
  "COVID19","maps","mapdata","qss","tidycensus", "dataverse", 
  # Analysis
  "DeclareDesign", "easystats", "zoo"
)

## Define a function to load (and if needed install) packages

#| label = "ipak"
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

## Install (if needed) and load libraries in the_packages
ipak(the_packages)
   kableExtra            DT        texreg     tidyverse     lubridate 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      forcats         haven      labelled         ggmap       ggrepel 
         TRUE          TRUE          TRUE          TRUE          TRUE 
     ggridges      ggthemes        ggpubr        GGally        scales 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      dagitty         ggdag       ggforce       COVID19          maps 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      mapdata           qss    tidycensus     dataverse DeclareDesign 
         TRUE          TRUE          TRUE          TRUE          TRUE 
    easystats           zoo 
         TRUE          TRUE 

Lab 5 & Simple Linear Regression

Review: Key Concepts from the Lab

Review: Simple Linear Regression

Let’s pick up where we left off on Thursday. First we’ll need to run some code to get to recreate our data

# ---- Load data ----
## Covid-19 Data
load(url("https://pols1600.paultesta.org/files/data/covid.rda"))
## Presidential Election Data
load(url("https://pols1600.paultesta.org/files/data/pres_df.rda"))

# ---- Recode Covid Data ----
territories <- c(
  "American Samoa",
  "Guam",
  "Northern Mariana Islands",
  "Puerto Rico",
  "Virgin Islands"
  )

# Filter out Territories and create state variable
covid_us <- covid %>%
  filter(!administrative_area_level_2 %in% territories)%>%
  mutate(
    state = administrative_area_level_2
  )

# Calculate new cases, new cases per capita, and 7-day average

covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_cases = confirmed - lag(confirmed),
    new_cases_pc = new_cases / population *100000,
    new_cases_pc_7day = zoo::rollmean(new_cases_pc, 
                                     k = 7, 
                                     align = "right",
                                     fill=NA )
    ) -> covid_us

# Recode facemask policy

covid_us %>%
mutate(
  # Recode facial_coverings to create face_masks
    face_masks = case_when(
      facial_coverings == 0 ~ "No policy",
      abs(facial_coverings) == 1 ~ "Recommended",
      abs(facial_coverings) == 2 ~ "Some requirements",
      abs(facial_coverings) == 3 ~ "Required shared places",
      abs(facial_coverings) == 4 ~ "Required all times",
    ),
    # Turn face_masks into a factor with ordered policy levels
    face_masks = factor(face_masks,
      levels = c("No policy","Recommended",
                 "Some requirements",
                 "Required shared places",
                 "Required all times")
    ) 
    ) -> covid_us

# Create year-month and percent vaccinated variables

covid_us %>%
  mutate(
    year = year(date),
    month = month(date),
    year_month = paste(year, 
                       str_pad(month, width = 2, pad=0), 
                       sep = "-"),
    percent_vaccinated = people_fully_vaccinated/population*100  
    ) -> covid_us

# Recode Deaths
covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_deaths = deaths - lag(deaths),
    new_deaths_pc = new_deaths / population *100000,
    new_deaths_pc_7day = zoo::rollmean(new_deaths_pc, 
                                     k = 7, 
                                     align = "right",
                                     fill=NA ),
    new_deaths_pc_14day = zoo::rollmean(new_deaths_pc, 
                                     k = 14, 
                                     align = "right",
                                     fill=NA )
    ) -> covid_us

# ---- Recode Presidential Election Data ----

# Transform Presidential Election data
pres_df %>%
  mutate(
    year_election = year,
    state = str_to_title(state),
    # Fix DC
    state = ifelse(state == "District Of Columbia", "District of Columbia", state)
  ) %>%
  filter(party_simplified %in% c("DEMOCRAT","REPUBLICAN"))%>%
  filter(year == 2020) %>%
  select(state, state_po, year_election, party_simplified, candidatevotes, totalvotes
         ) %>%
  pivot_wider(names_from = party_simplified,
              values_from = candidatevotes) %>%
  mutate(
    dem_voteshare = DEMOCRAT/totalvotes*100,
    rep_voteshare = REPUBLICAN/totalvotes*100,
    winner = forcats::fct_rev(factor(ifelse(rep_voteshare > dem_voteshare,"Trump","Biden")))
  ) -> pres2020_df

# ---- Merge Data ----

dim(covid_us)
[1] 53678    60
dim(pres2020_df)
[1] 51  9
covid_us <- covid_us %>% left_join(
  pres2020_df,
  by = c("state" = "state")
)
dim(covid_us) # Same number of rows as covid_us w/ 8 additional columns
[1] 53678    68

Calculating Conditional Means

Now let’s revisit question 6, which asked you to calculate some conditional means:

  • Overall
  • Before the vaccine was widely available
  • After the vaccine was widely available
# ---- Deaths: Overall ----
covid_us %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Overall"
  ) -> deaths_overall

# ---- Deaths: Pre Vaccine ----
covid_us %>%
  filter(date < "2021-04-19") %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Pre Vaccine"
  ) -> deaths_pre_vax

# ---- Deaths: Post Vaccine ----
covid_us %>%
  filter(date >= "2021-04-19") %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Post Vaccine"
  ) -> deaths_post_vax

# ---- Tidy outputs for display ----

deaths_tab <- deaths_overall %>% 
  bind_rows(
  deaths_pre_vax,
  deaths_post_vax
) %>% 
  mutate(
    comparison = factor(
      comparison,
      levels = c("Overall","Pre Vaccine", "Post Vaccine")
      )
  )
knitr::kable(deaths_tab) %>% 
  kableExtra::kable_styling()
winner new_deaths new_deaths_pc_7day comparison
Trump 19.34867 0.3402479 Overall
Biden 22.04853 0.2871843 Overall
Trump 22.82048 0.4000294 Pre Vaccine
Biden 30.61051 0.3801906 Pre Vaccine
Trump 17.07402 0.3016571 Post Vaccine
Biden 16.30683 0.2257111 Post Vaccine
# 1. Data
deaths_tab %>% 
  # 2. Aesthetics
  ggplot(
    aes(winner, new_deaths_pc_7day,
        fill = winner)
  ) +
  # 3. Geometries
  geom_bar(stat = "identity") +
  # 4. Facets
  facet_grid(~ comparison) +
  # 5. Labels and Themes
  guides(fill = "none") +
  labs(
    x = "State's won by",
    y = "Average # of Deaths per 100k\n(7-day rolling average)",
    title = "Red States have more Covid-19 deaths per capita after vaccine"
  )+
  theme_minimal()+
  theme(title = element_text(size = 10,face = "bold")) -> fig_q6

Using OLS to estimate conditional means

Question 7 asked you estimate the following OLS models:

\[ \text{New Deaths} = \beta_0 + \beta_1 \text{Election Winner} + \epsilon \]

\[ \text{7-day average of New Deaths (per 100k)} = \beta_0 + \beta_1 \text{Election Winner} + \epsilon \]

Note

Recall winner is a factor whose levels we set to be c("Trump","Biden").

lm() converts factors into binary indicators. Here 0="Trump" and 1="Biden"

m1_lab <- lm(new_deaths ~ winner, covid_us)
m2_lab <- lm(new_deaths_pc_7day ~ winner, covid_us)
m1_lab

Call:
lm(formula = new_deaths ~ winner, data = covid_us)

Coefficients:
(Intercept)  winnerBiden  
      19.35         2.70  
# Just the coefficients
coef(m2_lab)
(Intercept) winnerBiden 
 0.34024787 -0.05306357 
# Coefficients with summary stats (for later)
summary(m2_lab)

Call:
lm(formula = new_deaths_pc_7day ~ winner, data = covid_us)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0016 -0.2205 -0.1340  0.0966  5.8550 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.340248   0.002487  136.83   <2e-16 ***
winnerBiden -0.053064   0.003475  -15.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3978 on 52447 degrees of freedom
  (1229 observations deleted due to missingness)
Multiple R-squared:  0.004427,  Adjusted R-squared:  0.004408 
F-statistic: 233.2 on 1 and 52447 DF,  p-value: < 2.2e-16
texreg::htmlreg(list(m1_lab,m2_lab),
                custom.header = list(
                  "DV:" = 1:2
                ),
                custom.model.names = c(
                  "new_deaths",
                  "new_deaths_pc_7day"
                ))
Statistical models
  DV:
  new_deaths new_deaths_pc_7day
(Intercept) 19.35*** 0.34***
  (0.39) (0.00)
winnerBiden 2.70*** -0.05***
  (0.55) (0.00)
R2 0.00 0.00
Adj. R2 0.00 0.00
Num. obs. 52755 52449
***p < 0.001; **p < 0.01; *p < 0.05

Q7: When the CEF is Linear, OLS = CEF

If the CEF is linear, then OLS = CEF

In a saturated linear model, every group in the data can be represented by a coefficient or combination of coefficients in the model.

covid_us %>%
  mutate(
    biden01 = ifelse(winner == "Biden",1,0)
  ) %>% 
  ggplot(aes(biden01, new_deaths_pc_7day))+
  geom_jitter(size=.25,alpha=.15)+
  stat_summary(col="red")+
  stat_summary(aes(label=round(..y..,2)),
               geom = "text", 
               position = position_nudge(y = 1),
               col = "red",
               fun = mean) +
  theme_minimal()-> fig_q7cef

fig_q7cef +
  stat_smooth(method = "lm",se = F)+
  stat_summary(aes(label=round(..y..,2)),
               geom = "text", 
               position = position_nudge(y = .05),
               col = "red",
               fun = mean)+
  coord_cartesian(ylim=c(0,.45))+
  geom_segment(aes(
    x = 0,
    xend = 0,
    y = coef(m2_lab)[1],
    yend = coef(m2_lab)[1] + coef(m2_lab)[2]
  ),
  col = "blue",
  linetype = "dashed")+
  geom_segment(aes(
    x = 0,
    xend = 1,
    y = coef(m2_lab)[1] + coef(m2_lab)[2],
    yend = coef(m2_lab)[1] + coef(m2_lab)[2]
  ),
  col = "blue",
  linetype = "dashed")+
  annotate("text",
           label = expression(paste(beta[0]," = " )),
           x = 0, 
           y = coef(m2_lab)[1] + 0.075,
           hjust = "center",
           col = "blue"
           )+
    annotate("text",
           label = expression(paste(beta[1]," = 0.29 - 0.34 = -0.05" )),
           x = 0, 
           y = coef(m2_lab)[1] +coef(m2_lab)[2] - 0.05,
           hjust = "left",
           col = "blue"
           )+
  annotate("text",
           label = expression(
             paste(beta[0]," + ", beta[1], " = " )),
           x = 1, 
           y = coef(m2_lab)[1] + 0.05,
           hjust = "center",
           col = "blue"
           ) -> fig_q7ols

Q8: Vote shares, vaccinations, and deaths

Q9 asked you to fit three models exploring the relationship between:

\[ \text{m3} =\text{14-day average of New Deaths (per 100k)} = \beta_0 + \beta_1 \text{Percent Vaccinated} \] \[ \text{m4} =\text{Percent Vaccinated} = \beta_0 + \beta_1 \text{Republican Vote Share} \] \[ \text{m5} =\text{14-day average of New Deaths (per 100k)} = \beta_0 + \beta_1 \text{Republican Vote Share} \]

On September 23, 2021, when Leonhardt was writing

# Deaths modeled by percent vaccinated on 2021-09-23
m3_lab <- lm(new_deaths_pc_14day ~ percent_vaccinated, 
         covid_us,
         subset = date == "2021-09-23")

#  Percent vaccinated modeled by Republican vote share on 2021-09-23
m4_lab <- lm(percent_vaccinated ~ rep_voteshare, 
         covid_us,
         subset = date == "2021-09-23")

# Deaths modeled by Republican vote share on 2021-09-23
m5_lab <- lm(new_deaths_pc_14day ~ rep_voteshare, 
         covid_us,
         subset = date == "2021-09-23")
coef(m3_lab)
       (Intercept) percent_vaccinated 
         2.4232235         -0.0330773 
coef(m4_lab)
  (Intercept) rep_voteshare 
   84.3326265    -0.5731175 
coef(m5_lab)
  (Intercept) rep_voteshare 
  -0.36296799    0.01888996 

Q9: Visualizing Vote Shares and Deaths

Q9 asks you to visualize the results of the model m5

Let’s take a basic figure, and make it fetch!

covid_us %>%
  # Only use observations from September 23, 2021
  filter(date == "2021-09-23") %>%
  # Exclude DC
  filter(state != "District of Columbia") %>%
  # Set aesthetics
  ggplot(aes(x = rep_voteshare,
             y = new_deaths_pc_14day))+
  # Set geometries
  geom_point(aes(col=rep_voteshare),size=2,alpha=.5)+
  # Include the linear regression of lm(new_deaths_pc_14day ~ rep_voteshare)
  geom_smooth(method = "lm", se=F,
              col = "grey", linetype =2) -> fig_m5_lab
fig_m5_lab

fig_m5_lab +
  # two way gradient, blue states -> blue, red states -> red, swing -> grey
  scale_color_gradient2(
    midpoint = 50,
    low = "blue", mid = "grey", high = "red",
    guide = "none")+
  # Vertical line at 50% threshold
  geom_vline(xintercept = 50, 
             col = "grey",linetype = 3)+
  # Add labels
  ggrepel::geom_text_repel(aes(label = state_po), size=2)+
  # theme with minimal lines
  theme_classic()+
  # Labels
  labs(
    x = "Republican Vote Share\n 2020 Presidential Election",
    y = "New Covid-19 Deaths per 100kresidents\n (14-day average)",
    title = "Partisan Gaps in Covid-19 Deaths at the State Level",
    subtitle = "Data from Sept. 23, 2021"
  ) -> fig_m5_lab_fetch
fig_m5_lab_fetch

Q10: Alternative Explanations

  • Finally Q10 asked you to consider some possible alternative explanations or omitted variables that might explain the positive relationship, between Republican Vote Share and Covid-19 deaths.

  • In this week’s lab, we’ll see how we can use multiple regression to evaluate these claims.

Previewing Lab 6

Red Covid

The core thesis of Red Covid is something like the following:

Since Covid-19 vaccines became widely available to the general public in the spring of 2021, Republicans have been less likely to get the vaccine. Lower rates of vaccination among Republicans have in turn led to higher rates of death from Covid-19 in Red States compared to Blue States.

Testing Alternative Explanations of Red Covid

  • A skeptic might argue that this relationship is spurious.

  • There are lots of ways that Red States differ from Blue States – demographics, economics, geography, culture – that might explain the differences in Covid-19 deaths.

Testing Alternative Explanations of Red Covid

  • In Lab 6, we use multiple regression to try and control for the following alternative explanations:

    • Differences in median age
    • Differences in median income
  • To do this, we need to merge in some additional state level data from the census.

Loading data from the Census

If you worked through the instructions here and installed tidycensus and a Census API key on your machine, you should be able to run the following:

acs_df <- tidycensus::get_acs(geography = "state", 
              variables = c(med_income = "B19013_001",
                            med_age = "B01002_001"), 
              year = 2019)

If not, no worries, just uncomment the code below:

# Uncomment if get_acs() doesn't work:
# load(url("https://pols1600.paultesta.org/files/data/acs_df.rda"))

Tidy Census Data

get_acs() returns data whose unit of analysis is roughly a state-variable

We need to reshape acs_df using pivot_wider() so that the unit of analysis is just a state, and each column variable is a column

acs_df
# A tibble: 104 × 5
   GEOID NAME       variable   estimate    moe
   <chr> <chr>      <chr>         <dbl>  <dbl>
 1 01    Alabama    med_age        39      0.2
 2 01    Alabama    med_income  50536    304  
 3 02    Alaska     med_age        34.3    0.1
 4 02    Alaska     med_income  77640   1015  
 5 04    Arizona    med_age        37.7    0.2
 6 04    Arizona    med_income  58945    266  
 7 05    Arkansas   med_age        38.1    0.1
 8 05    Arkansas   med_income  47597    328  
 9 06    California med_age        36.5    0.1
10 06    California med_income  75235    232  
# ℹ 94 more rows
acs_df %>%
  mutate(
    state = NAME,
  ) %>%
  select(state, variable, estimate) %>%
  pivot_wider(names_from = variable,
              values_from = estimate) -> acs_df
acs_df
# A tibble: 52 × 3
   state                med_age med_income
   <chr>                  <dbl>      <dbl>
 1 Alabama                 39        50536
 2 Alaska                  34.3      77640
 3 Arizona                 37.7      58945
 4 Arkansas                38.1      47597
 5 California              36.5      75235
 6 Colorado                36.7      72331
 7 Connecticut             41        78444
 8 Delaware                40.6      68287
 9 District of Columbia    34        86420
10 Florida                 42        55660
# ℹ 42 more rows

Merge Census data into Covid data

Now we can merge acs_df into covid_us using the left_join() function

Note

  • In the code, we’ll save the output of joining acs_df to covid_us to a new dataframe called covid_df to avoid potentially duplicating columns
dim(acs_df)
[1] 52  3
dim(acs_df)
[1] 52  3
dim(covid_us)
[1] 53678    68
# Merge tmp with acs_df and save as final covid_df file
covid_df <- covid_us %>% left_join(
  acs_df,
  by = c("state" = "state")
)
dim(covid_df)  # Same number of rows as tmp w/ 2 additional columns
[1] 53678    70

Subset Data

Next, we’ll subset the data to include just the values from variables we’ll use in the lab on from September 23, 2021 using:

  • filter()
  • select()

Note

  • Again, we’ll save this output to a new object called covid_lab
the_vars <- c(
  # Covid variables
  "state","state_po","date","new_deaths_pc_14day", "percent_vaccinated",
  # Election variables
  "winner","rep_voteshare",
  # Demographic variables
  "med_age","med_income","population")


covid_lab <- covid_df %>%
  filter( date == "2021-09-23") %>%
  select(all_of(the_vars)) %>%
  ungroup()

length(the_vars)
[1] 10
dim(covid_lab)
[1] 51 10
covid_lab
# A tibble: 51 × 10
   state       state_po date       new_deaths_pc_14day percent_vaccinated winner
   <chr>       <I<chr>> <date>                   <dbl>              <dbl> <fct> 
 1 Minnesota   MN       2021-09-23               0.222               61.0 Biden 
 2 California  CA       2021-09-23               0.295               60.8 Biden 
 3 Florida     FL       2021-09-23               1.61                58.4 Trump 
 4 Wyoming     WY       2021-09-23               0.938               43.9 Trump 
 5 South Dako… SD       2021-09-23               0.291               52.4 Trump 
 6 Kansas      KS       2021-09-23               0.559               53.9 Trump 
 7 Nevada      NV       2021-09-23               0.700               51.8 Biden 
 8 Virginia    VA       2021-09-23               0.379               62.5 Biden 
 9 Washington  WA       2021-09-23               0.532               63.2 Biden 
10 Oregon      OR       2021-09-23               0.439               62.3 Biden 
# ℹ 41 more rows
# ℹ 4 more variables: rep_voteshare <dbl>, med_age <dbl>, med_income <dbl>,
#   population <int>

Standardized Variables

When numeric variables are measured on different scales, it can be useful to construct standardized measures, sometimes called z-scores

\[z\text{-scores of x} = \frac{x_i - \mu_{x}}{\sigma_x}\]

  • The z-score of Age is

\[z\text{-scores of Age} = \frac{\text{Age}_i - \text{Average Age}}{\text{Standard Deviation of Age}}\]

Note

  • Standardized variables all have a mean of 0 and a standard deviation of 1
  • Standardizing variables helps us interpret coefficients in multiple regressions with predictors measured on different scales

Standardizing variables for the lab

Create standardized versions of

  • rep_voteshare
  • med_age
  • med_income
  • percent_vaccinated

Note

  • We’ll use the suffix _std to distinguish the standardized variables from the original variables
covid_lab %>%
  mutate(
    rep_voteshare_std = (rep_voteshare - mean(rep_voteshare)) / sd(rep_voteshare),
    med_age_std = ( med_age - mean( med_age)) / sd( med_age),
    med_income_std = (med_income - mean(med_income)) / sd(med_income),
    percent_vaccinated_std = (percent_vaccinated - mean(percent_vaccinated)) / sd(percent_vaccinated)
  ) -> covid_lab
covid_lab %>%
  summarise(
    mn_rep_vote = mean(rep_voteshare),
    mn_rep_vote_std = round(mean(rep_voteshare_std),2),
    sd_rep_vote = sd(rep_voteshare),
    sd_rep_vote_std = sd(rep_voteshare_std)
  )
# A tibble: 1 × 4
  mn_rep_vote mn_rep_vote_std sd_rep_vote sd_rep_vote_std
        <dbl>           <dbl>       <dbl>           <dbl>
1        49.2               0        12.0               1
covid_lab %>% 
  ggplot(aes(rep_voteshare_std,rep_voteshare))+
  geom_point()

Save Data

Finally, I’ll save the data for Thursday’s lab

# Don't run this code
save(covid_lab, file = "../files/data/06_lab.rda")

And on Thursday, we’ll be able to load the covid_lab just by running:

load(url("https://pols1600.paultesta.org/files/data/06_lab.rda"))

Multiple Regression

Conceptual: Multiple Regression

  • Multiple linear regression generalizes simple regression to models with multiple predictors

\[y = \beta_0 + \beta_1x_{1} +\beta_2x_{2} \dots \beta_kx_k + \epsilon\] More compactly written in matrix notation:

\[y = X\beta + \epsilon\] Where:

\[ \overbrace{\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}}^{y} = \overbrace{ \begin{bmatrix} 1 & x_{1,1}&\dots & x_{1,k}\\ 1 & x_{2,1}&\dots & x_{2,k}\\ \vdots&\vdots& & \vdots \\ 1 & x_{n,1}&\dots & x_{n,k}\\ \end{bmatrix}}^{X} \overbrace{\begin{bmatrix} \beta_0\\ \beta_1\\ \vdots \\ \beta_k\\ \end{bmatrix}}^{\beta} + \overbrace{\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}}^{\epsilon} \]

Conceptual: Multiple Regression

  • Regression models partition variance

  • Separate variation in the outcome (\(y\)) into variation explained by the predictors in the model and the residual variation not explained by these predictors

  • Regression coefficients tell us how the outcome \(y\) is expected to change if \(x\) changes by one unit, holding constant or controlling for other predictors in the model.

Practical: Multiple Regression

  • We estimate linear models in R using the lm() function using the + to add predictors

  • We use the * to include the main effects \((\beta_1 x, \beta_2z)\) and interactions \((\beta_3 (x\cdot z))\)of two predictors

lm(y ~ x + z, data = df)
lm(y ~ x*z, data = df) # Is a shortcut for:
lm(y ~ x + z + x:z, data = df)

Technical: Multiple Regression

  • Simple linear regression chooses a \(\hat{\beta_0}\) and \(\hat{\beta_1}\) to minimize the Sum of Squared Residuals (SSR):

\[\textrm{Find }\hat{\beta_0},\,\hat{\beta_1} \text{ arg min}_{\beta_0, \beta_1} \sum (y_i-(\hat{\beta_0}+\hat{\beta_1}x_i))^2\]

  • Multiple linear regression chooses a vector of coefficients \(\hat{\beta}\) to minimize the Sum of Squared Residuals (SSR):

\[\textrm{Find }\widehat{\beta} \text{ argmin }_{\widehat{\beta}} \sum \epsilon^2=\epsilon^\prime\epsilon=(y-X\widehat{\beta})^\prime(y-X\widehat{\beta})\]

Theoretical: Multiple Regression

  • Multiple Linear regression still provides a linear estimate of the conditional expectation function (CEF): \(E[Y|X]\) where \(Y\) is now a function of multiple predictors, \(X\)

Estimating and Interpretting Multiple Regression

Estimating and Interpretting Multiple Regressions

Let’s load, inspect, and recode some data from the 2016 NES and explore the relationship between political interest and evaluations of presidential candidates:

  • Political Interest: “How interested are you in in politics?
    • Very Interested
    • Somewhat interested
    • Not very Interested
    • Not at all Interested
  • Feeling Thermometer: “… On the feeling thermometer scale of 0 to 100, how would you rate”
    • Donald Trump
    • Hillary Clinton
# Load data from 2016 NES
load(url("https://pols1600.paultesta.org/files/data/nes.rda"))
# Take a quick look at the data
dim(nes)
[1] 1200   14
head(nes)
  caseid    state age gender educ faminc pid7 ideo5 pol_interest church_atd
1    745  Alabama  19      2    3     97    5     3            1          2
2   1115  Alabama  46      1    3      3    1     1            3          6
3    258  Alabama  59      2    2      6    2     3            2          1
4    126  Alabama  55      2    4      6    1     2            3          5
5    414  Alabama  66      1    3      8    7     4            3          3
6    523  Alabama  61      1    2     97    1     2            3          6
  bornagain01 ft_trump ft_hrc vote_choice
1           0        3     19       Other
2           0        0     36         HRC
3           1       22      2        <NA>
4           0        1     80         HRC
5           1      100      3        <NA>
6           1        0    100        <NA>
table(nes$pol_interest)

  0   1   2   3 
 78 178 348 568 
summary(nes$ft_trump)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    2.00   30.00   38.38   72.00  100.00       3 
summary(nes$ft_hrc)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    3.00   44.00   42.99   76.00  100.00       1 
nes %>%
  mutate(
    income = ifelse(faminc > 16, NA, faminc),
    interested = ifelse(pol_interest==3,T,F),
    pol_interest_f = factor(case_when(
      pol_interest == 0 ~ "Not at all Interested",
      pol_interest == 1 ~ "Not very Interested",
      pol_interest == 2 ~ "Somewhat Interested",
      pol_interest == 3 ~ "Very Interested"
    )),
    tc_diff = abs(ft_trump - ft_hrc)
  ) -> nes

Estimating and Interpreting Models

Let’s estimate the following models:

\[\text{m1: tc_diff} = \beta_0 + \beta_1\text{interested}+\epsilon \]

\[\text{m2: tc_diff} = \beta_0 + \beta_1\text{pol_interest}+\epsilon\]

\[\text{m3: tc_diff} = \beta_0 + \beta_1\text{pol_interest_f} +\epsilon\]

\[\text{m4: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} +\epsilon\]

\[\text{m5: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} + \beta_3\text{interested} \times \text{age}+\epsilon\]

\[\text{m6: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} +\epsilon\]

\[\text{m7: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} + \beta_4\text{age}\times \text{income}+\epsilon \]

We’ll use:

  • lm() to estimate models
  • coef() and summary() to examine our results
  • htmlreg()to format our results
  • data.frame() to create prediction dataframes
  • predict() to produce predicted values and cbind() to combine these predicted back into the prediction dataframes for plotting
  • ggplot() to display and interpret our models’ predictions

Regression Tables

  • Academic articles are littered with regression tables.

  • Below we’ll see how to:

    • produce regression tables in R
    • use some heuristics to interpret regression tables
  • We’ll cover the why with greater depth and care later in the course, for now, let’s focus on the what and how

Making Regression Tables in R

We can make a very simple regression table using the htmlreg function from the texreg package

Note

To properly display your results you need to add #| results: asis the following to the code chunk header

```{r}
#| echo: false
#| results: asis
texreg::htmlreg(
  list(m1_lab,m2_lab, m3_lab,m4_lab,m5_lab)
)
```
Statistical models
  Model 1 Model 2 Model 3 Model 4 Model 5
(Intercept) 19.35*** 0.34*** 2.42*** 84.33*** -0.36
  (0.39) (0.00) (0.30) (2.66) (0.20)
winnerBiden 2.70*** -0.05***      
  (0.55) (0.00)      
percent_vaccinated     -0.03***    
      (0.01)    
rep_voteshare       -0.57*** 0.02***
        (0.05) (0.00)
R2 0.00 0.00 0.44 0.71 0.31
Adj. R2 0.00 0.00 0.43 0.70 0.29
Num. obs. 52755 52449 51 51 51
***p < 0.001; **p < 0.01; *p < 0.05

Interpreting Regression Tables (Stargazing)

  • Each column is a model
  • Each row is a coefficient from that model with its standard error (more to come) in parentheses below
  • We interpret coefficients by looking at their sign, size, and significance
    • Coefficients with asterisks * are statistically significant (more to come)
    • It is unlikely that we would see a coefficient this big or bigger if the true coefficient were 0
  • Rule of thumb:

\[\text{If }\frac{\beta}{se} > 2 \to \text{Statistically Significant}\]

m1: A binary indicator

\[\text{m1: tc_diff} = \beta_0 + \beta_1\text{interested}+\epsilon \]

m1 <- lm(tc_diff ~ interested, nes)
coef(m1)
   (Intercept) interestedTRUE 
      47.84500       12.74655 
mean(nes$tc_diff[nes$interested == F], na.rm=T)
[1] 47.845
mean(nes$tc_diff[nes$interested == T], na.rm=T)
[1] 60.59155
mean(nes$tc_diff[nes$interested == T], na.rm=T) -
  mean(nes$tc_diff[nes$interested == F], na.rm=T)
[1] 12.74655
htmlreg(m1)
Statistical models
  Model 1
(Intercept) 47.84***
  (1.26)
interestedTRUE 12.75***
  (1.81)
R2 0.04
Adj. R2 0.04
Num. obs. 1168
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df1 <- expand_grid(
  interested = c(F,T)
)

pred_df1  <- cbind(
  pred_df1 ,
  fit = predict(m1, pred_df1 )
)

# Produce figure

pred_df1 %>% 
  ggplot(aes(interested, fit))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)),
    aes(x= interested,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(col="red") -> fig_tc_m1

m2: A numerical predictor

\[\text{m2: tc_diff} = \beta_0 + \beta_1\text{pol_interest}+\epsilon\]

m2 <- lm(tc_diff ~ pol_interest, nes)
round(coef(m2),2)
 (Intercept) pol_interest 
       40.80         6.01 
htmlreg(list(m1,m2))
Statistical models
  Model 1 Model 2
(Intercept) 47.84*** 40.80***
  (1.26) (2.35)
interestedTRUE 12.75***  
  (1.81)  
pol_interest   6.01***
    (0.98)
R2 0.04 0.03
Adj. R2 0.04 0.03
Num. obs. 1168 1168
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df2 <- expand_grid(
  pol_interest = na.omit(sort(unique(nes$pol_interest)))
)

pred_df2  <- cbind(
  pred_df2 ,
  fit = predict(m2, pred_df2 )
)

# Produce figure

pred_df2 %>% 
  ggplot(aes(pol_interest, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(pol_interest)),
    aes(x= pol_interest,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_line(col="red") +
  geom_point(col="red") -> fig_tc_m2

m3: A categorical indicator

\[\text{m3: tc_diff} = \beta_0 + \beta_1\text{pol_interest_f} +\epsilon\]

m3 <- lm(tc_diff ~ pol_interest_f, nes)
round(coef(m3),2)
                      (Intercept) pol_interest_fNot very Interested 
                            46.10                              1.73 
pol_interest_fSomewhat Interested     pol_interest_fVery Interested 
                             2.14                             14.49 
nes %>% 
  group_by(pol_interest_f) %>%
  filter(!is.na(pol_interest_f)) %>% 
  summarise(
    mean = mean(tc_diff,na.rm = T),
    beta = round(mean - coef(m3)[1],3)
  )
# A tibble: 4 × 3
  pol_interest_f         mean  beta
  <fct>                 <dbl> <dbl>
1 Not at all Interested  46.1  0   
2 Not very Interested    47.8  1.73
3 Somewhat Interested    48.2  2.14
4 Very Interested        60.6 14.5 

lm() converts the factor pol_interest_f into binary indicators for every value of pol_interest_f excluding “Not at all Interested”, the first level of the factor.

“Not at all Interested” is the reference category because all the other coefficients describe how the means for other levels of pol_interest_f differ from “Not at all Interested”

cbind(
m3$model[26:30,],
model.matrix(m3)[26:30,]
)
   tc_diff        pol_interest_f (Intercept) pol_interest_fNot very Interested
26      96 Not at all Interested           1                                 0
27      93   Somewhat Interested           1                                 0
28      95       Very Interested           1                                 0
29      75   Somewhat Interested           1                                 0
30       1   Not very Interested           1                                 1
   pol_interest_fSomewhat Interested pol_interest_fVery Interested
26                                 0                             0
27                                 1                             0
28                                 0                             1
29                                 1                             0
30                                 0                             0
htmlreg(list(m1,m2,m3))
Statistical models
  Model 1 Model 2 Model 3
(Intercept) 47.84*** 40.80*** 46.10***
  (1.26) (2.35) (3.53)
interestedTRUE 12.75***    
  (1.81)    
pol_interest   6.01***  
    (0.98)  
pol_interest_fNot very Interested     1.73
      (4.23)
pol_interest_fSomewhat Interested     2.14
      (3.90)
pol_interest_fVery Interested     14.49***
      (3.76)
R2 0.04 0.03 0.04
Adj. R2 0.04 0.03 0.04
Num. obs. 1168 1168 1168
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df3 <- expand_grid(
  pol_interest_f = na.omit(sort(unique(nes$pol_interest_f)))
)

pred_df3  <- cbind(
  pred_df3 ,
  fit = predict(m3, pred_df3 )
)

# Produce figure

pred_df3 %>% 
  ggplot(aes(pol_interest_f, fit,
             col = pol_interest_f))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(pol_interest_f)),
    aes(x= pol_interest_f,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point() -> fig_tc_m3

m4: A binary and numerical predictor

\[\text{m4: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} +\epsilon\]

m4 <- lm(tc_diff ~ interested + age, nes)
coef(m4)
   (Intercept) interestedTRUE            age 
    32.1757944      9.7739375      0.3533876 
htmlreg(list(m1, m4))
Statistical models
  Model 1 Model 2
(Intercept) 47.84*** 32.18***
  (1.26) (2.71)
interestedTRUE 12.75*** 9.77***
  (1.81) (1.84)
age   0.35***
    (0.05)
R2 0.04 0.07
Adj. R2 0.04 0.07
Num. obs. 1168 1168
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df4 <- expand_grid(
  interested = c(F,T),
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10)
)

pred_df4  <- cbind(
  pred_df4 ,
  fit = predict(m4, pred_df4 )
)

# Produce figure

pred_df4 %>% 
  ggplot(aes(age, fit,
             col = interested,
             group = interested))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() -> fig_tc_m4

m5: An interaction between a binary and numerical predictor

\[\text{m5: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} + \beta_3\text{interested} \times \text{age}+\epsilon\]

m5 <- lm(tc_diff ~ interested*age, nes)
coef(m5)
       (Intercept)     interestedTRUE                age interestedTRUE:age 
        39.7626421         -6.7040163          0.1822814          0.3396523 
htmlreg(list(m1, m4, m5))
Statistical models
  Model 1 Model 2 Model 3
(Intercept) 47.84*** 32.18*** 39.76***
  (1.26) (2.71) (3.62)
interestedTRUE 12.75*** 9.77*** -6.70
  (1.81) (1.84) (5.55)
age   0.35*** 0.18*
    (0.05) (0.08)
interestedTRUE:age     0.34**
      (0.11)
R2 0.04 0.07 0.08
Adj. R2 0.04 0.07 0.08
Num. obs. 1168 1168 1168
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df5 <- expand_grid(
  interested = c(F,T),
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10)
)

pred_df5  <- cbind(
  pred_df5 ,
  fit = predict(m5, pred_df5 )
)

# Produce figure

pred_df5 %>% 
  ggplot(aes(age, fit,
             col = interested,
             group = interested))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() -> fig_tc_m5

m6: Two numerical predictors

\[\text{m6: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} +\epsilon\]

m6 <- lm(tc_diff ~ age + income, nes)
coef(m6)
(Intercept)         age      income 
 33.1994121   0.4002682   0.1573334 
htmlreg(list( m6))
Statistical models
  Model 1
(Intercept) 33.20***
  (3.35)
age 0.40***
  (0.06)
income 0.16
  (0.29)
R2 0.04
Adj. R2 0.04
Num. obs. 1049
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df6_age <- expand_grid(
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10),
  # Hold income constant at mean value
  income = mean(nes$income,na.rm=T)
)

pred_df6_income <- expand_grid(
  income = seq(min(nes$income, na.rm = T),
            max(nes$income, na.rm=T),
            length.out = 16),
  # Hold income constant at mean value
  age = mean(nes$age,na.rm=T)
)

pred_df6_age  <- cbind(
  pred_df6_age ,
  fit = predict(m6, pred_df6_age )
)

pred_df6_income  <- cbind(
  pred_df6_income ,
  fit = predict(m6, pred_df6_income )
)
# Produce figure

pred_df6_age %>% 
  ggplot(aes(age, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() +
  labs(title ="Age holding income constant") -> fig_tc_m6_age

pred_df6_income %>% 
  ggplot(aes(income, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= income,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() +
  labs(title = "Income holding age constant") -> fig_tc_m6_income


fig_tc_m6 <- ggpubr::ggarrange(fig_tc_m6_age, fig_tc_m6_income)

m7: Interaction between two numerical predictors

\[\text{m7: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} + \beta_4\text{age}\times \text{income}+\epsilon \]

m7 <- lm(tc_diff ~ age*income, nes)
coef(m7)
(Intercept)         age      income  age:income 
38.22449671  0.29406374 -0.78888643  0.01993978 
htmlreg(list( m6, m7))
Statistical models
  Model 1 Model 2
(Intercept) 33.20*** 38.22***
  (3.35) (5.80)
age 0.40*** 0.29*
  (0.06) (0.12)
income 0.16 -0.79
  (0.29) (0.94)
age:income   0.02
    (0.02)
R2 0.04 0.05
Adj. R2 0.04 0.04
Num. obs. 1049 1049
***p < 0.001; **p < 0.01; *p < 0.05
# Create Prediction Data Frame

pred_df7 <- expand_grid(
  age = seq(19, 95,
            by = 4),
  # Hold income constant at mean value
  income = seq(min(nes$income, na.rm = T),
            max(nes$income, na.rm=T),
            length.out = 16)
)



pred_df7  <- cbind(
  pred_df7 ,
  fit = predict(m7, pred_df7 )
)

# Produce figure

# Marginal effect of age at min, median, and max values of income
pred_df7 %>%
  mutate(
    income_at = case_when(
      income == 1 ~ "01",
      income == median(nes$income,na.rm=T) ~ "05",
      income == 16 ~ "16",
      T ~ NA_character_
    )
  ) %>% 
  filter(!is.na(income_at)) %>% 
  ggplot(aes(age, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(aes(col =income_at,
                 group = income_at))+
  geom_line(aes(col =income_at,
                group = income_at)
            ) -> fig_tc_m7_age

# Marginal effect of income at min, median, and max values of age
pred_df7 %>%
  mutate(
    age_at = case_when(
      age == min(nes$age, na.rm=T) ~ "19",
      age == 47 ~ "47", # close enough ...
      age ==  max(nes$age, na.rm=T) ~ "95",
      T ~ NA_character_
    )
  ) %>% 
  filter(!is.na(age_at)) %>% 
  ggplot(aes(income, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= income,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(aes(col =age_at,
                 group = age_at))+
  geom_line(aes(col =age_at,
                group = age_at)
            ) -> fig_tc_m7_income


fig_tc_m7 <- ggpubr::ggarrange(fig_tc_m7_age, 
                               fig_tc_m7_income,
                               legend = "bottom")

Difference-in-Differences

Motivating Example: What causes Cholera?

  • In the 1800s, cholera was thought to be transmitted through the air.

  • John Snow (the physician, not the snack), to explore the origins eventunally concluding that cholera was transmitted through living organisms in water.

  • Leveraged a natural experiment in which one water company in London moved its pipes further upstream (reducing contamination for Lambeth), while other companies kept their pumps serving Southwark and Vauxhall in the same location.

Notation

Let’s adopt a little notation to help us think about the logic of Snow’s design:

  • \(D\): treatment indicator, 1 for treated neighborhoods (Lambeth), 0 for control neighborhoods (Southwark and Vauxhall)

  • \(T\): period indicator, 1 if post treatment (1854), 0 if pre-treatment (1849).

  • \(Y_{di}(t)\) the potential outcome of unit \(i\)

    • \(Y_{1i}(t)\) the potential outcome of unit \(i\) when treated between the two periods

    • \(Y_{0i}(t)\) the potential outcome of unit \(i\) when control between the two periods

Causal Effects

The individual causal effect for unit i at time t is:

\[\tau_{it} = Y_{1i}(t) − Y_{0i}(t)\]

What we observe is

\[Y_i(t) = Y_{0i}(t)\cdot(1 − D_i(t)) + Y_{1i}(t)\cdot D_i(t)\]

\(D\) only equals 1, when \(T\) equals 1, so we never observe \(Y_0i(1)\) for the treated units.

In words, we don’t know what Lambeth’s outcome would have been in the second period, had they not been treated.

Average Treatment on Treated

Our goal is to estimate the average effect of treatment on treated (ATT):

\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]

That is, what would have happened in Lambeth, had their water company not moved their pipes

Average Treatment on Treated

Our goal is to estimate the average effect of treatment on treated (ATT):

We we can observe is:

Pre-Period (T=0) Post-Period (T=1)
Treated \(D_{i}=1\) \(E[Y_{0i}(0)\vert D_i = 1]\) \(E[Y_{1i}(1)\vert D_i = 1]\)
Control \(D_i=0\) \(E[Y_{0i}(0)\vert D_i = 0]\) \(E[Y_{0i}(1)\vert D_i = 0]\)

Data

Because potential outcomes notation is abstract, let’s consider a modified description of the Snow’s cholera death data from Scott Cunningham:

Company 1849 (T=0) 1854 (T=1)
Lambeth (D=1) 85 19
Southwark and Vauxhall (D=0) 135 147

How can we estimate the effect of moving pumps upstream?

Recall, our goal is to estimate the effect of the the treatment on the treated:

\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]

Let’s conisder some strategies Snow could take to estimate this quantity:

Before vs after comparisons:

  • Snow could have compared Labmeth in 1854 \((E[Y_i(1)|D_i = 1] = 19)\) to Lambeth in 1849 \((E[Y_i(0)|D_i = 1]=85)\), and claimed that moving the pumps upstream led to 66 fewer cholera deaths.

  • Assumes Lambeth’s pre-treatment outcomes in 1849 are a good proxy for what its outcomes would have been in 1954 if the pumps hadn’t moved \((E[Y_{0i}(1)|D_i = 1])\).

  • A skeptic might argue that Lambeth in 1849 \(\neq\) Lambeth in 1854

Company 1849 (T=0) 1854 (T=1)
Lambeth (D=1) 85 19
Southwark and Vauxhall (D=0) 135 147

Treatment-Control comparisons in the Post Period.

  • Snow could have compared outcomes between Lambeth and S&V in 1954 (\(E[Yi(1)|Di = 1] − E[Yi(1)|Di = 0]\)), concluding that the change in pump locations led to 128 fewer deaths.

  • Here the assumption is that the outcomes in S&V and in 1854 provide a good proxy for what would have happened in Lambeth in 1954 had the pumps not been moved \((E[Y_{0i}(1)|D_i = 1])\)

  • Again, our skeptic could argue Lambeth \(\neq\) S&V

Company 1849 (T=0) 1854 (T=1)
Lambeth (D=1) 85 19
Southwark and Vauxhall (D=0) 135 147

Difference in Differences

To address these concerns, Snow employed what we now call a difference-in-differences design,

There are two, equivalent ways to view this design.

\[\underbrace{\{E[Y_{i}(1)|D_{i} = 1] − E[Y_{i}(1)|D_{i} = 0]\}}_{\text{1. Treat-Control |Post }}− \overbrace{\{E[Y_{i}(0)|D_{i} = 1] − E[Y_{i}(0)|D_{i}=0 ]}^{\text{Treated-Control|Pre}}\]

  • Difference 1: Average change between Treated and Control in Post Period

  • Difference 2: Average change between Treated and Control in Pre Period

Difference in Differences

\[\underbrace{\{E[Y_{i}(1)|D_{i} = 1] − E[Y_{i}(1)|D_{i} = 0]\}}_{\text{1. Treat-Control |Post }}− \overbrace{\{E[Y_{i}(0)|D_{i} = 1] − E[Y_{i}(0)|D_{i}=0 ]}^{\text{Treated-Control|Pre}}\] Is equivalent to:

\[\underbrace{\{E[Y_{i}(1)|D_{i} = 1] − E[Y_{i}(0)|D_{i} = 1]\}}_{\text{Post - Pre |Treated }}− \overbrace{\{E[Y_{i}(1)|D_{i} = 0] − E[Y_{i}(0)|D_{i}=0 ]}^{\text{Post-Pre|Control}}\]

  • Difference 1: Average change between Treated over time
  • Difference 2: Average change between Control over time

Difference in Differences

You’ll see the DiD design represented both ways, but they produce the same result:

\[ \tau_{ATT} = (19-147) - (85-135) = -78 \]

\[ \tau_{ATT} = (19-85) - (147-135) = -78 \]

Identifying Assumption of a Difference in Differences Design

The key assumption in this design is what’s known as the parallel trends assumption: \(E[Y_{0i}(1) − Y_{0i}(0)|D_i = 1] = E[Y_{0i}(1) − Y_{0i}(0)|D_i = 0]\)

  • In words: If Lambeth hadn’t moved its pumps, it would have followed a similar path as S&V

Using lm to estimate Diff-in-Diff

cholera_df <- tibble(
  Location = c("S&V","Lambeth","S&V","Lambeth"),
  Treated = c(0,1,0, 1),
  Time = c(0,0,1,1),
  Deaths = c(135,85,147,19)
)
cholera_df
# A tibble: 4 × 4
  Location Treated  Time Deaths
  <chr>      <dbl> <dbl>  <dbl>
1 S&V            0     0    135
2 Lambeth        1     0     85
3 S&V            0     1    147
4 Lambeth        1     1     19

\[ \text{Deaths} = \beta_0 + \beta_1\text{Treated} + \beta_2\text{Time} + \beta_3 \text{Treated}\times\text{Time} \]

diff_in_diff <- lm(Deaths ~ Treated + Time + Treated:Time, cholera_df)
diff_in_diff

Call:
lm(formula = Deaths ~ Treated + Time + Treated:Time, data = cholera_df)

Coefficients:
 (Intercept)       Treated          Time  Treated:Time  
         135           -50            12           -78  

  • \(\beta_0=\) Outcome in control (S&V) before treatment
  • \(\beta_1=\) Fixed, time invariant differences between treated and control
  • \(\beta_2=\) Fixed, unit invariant differences between pre and post periods
  • \(\beta_3=\) Difference-in-Differences = \(E[Y_{1i}(1) - Y_{0i}(1)|D=1]\)

Summary

  • A Difference in Differences (DiD, or diff-in-diff) design combines a pre-post comparison, with a treated and control comparison

  • Differencing twice accounts for fixed differences across units and between periods

    • But not time varying differences across units…
  • The key identifying assumption of a DiD design is the assumption of parallel trends

    • Absent treatment, treated and control groups would see the same changes over time.
    • Hard to prove, possible to test

Extensions and limitations

  • Diff-in-Diff easy to estimate with linear regression
  • Generalizes to multiple periods and treatment interventions
    • More pre-treatment periods allow you assess “parallel trends” assumption
  • Alternative methods
    • Synthetic control
    • Event Study Designs
  • What if you have multiple treatments or treatments that come and go?
    • Panel Matching
    • Generalized Synthetic control

Applications

Summary

Summary - Linear Regression

  • Linear regression produces linear estimates of the Conditional Expectation Function

  • We estimate linear regression using lm()

m1 <- lm( y ~ x1 + x2 + x3, data = df)
  • We interpret linear regression by: looking at the sign, size, and, eventually, significance of coefficients

    • the intercept \((\beta_0)\) corresponds to the model’s prediction when every other predictor is zero
    • the other \(\beta\)s describe how \(y\) changes with a [unit change] in \(x\), controlling for other predictors in the model
  • We present our results using regression tables and figures showing predicted values

Summary - Difference-in-Differences

  • Difference-in-Differences (DiD) is a powerful research design for observational data that combines a pre-post comparisons with a treated and control comparisons

  • Differencing twice accounts for fixed differences across units and between periods

  • DiD relies on an assumption of parallel trends

  • We can use linear regression to estimate and generalize DiD designs

References