POLS 1600

Multiple Regression

Updated Jan 13, 2025

Overview

Class Plan

Announcements (20)
- Assignment 1: Research Questions due Tuesday, March 12
- Assignment 2 Data: due Friday March 22
Feedback
Review: Simple Linear Regression and Lab 5 (15-20 min)
Preview: Setup for Lab 6 (10-15 min)
Estimating and Interpreting Multiple Regression (25-30 min)
Difference-in-Differences (15-20 min)

Annoucements

Assignment 1

Prompt posted on website here
3 possible research questions uploaded to Canvas by Next Tuesday, March 12
The ideal experiment = “What would you have to randomly assign to create a meaningful counterfactual comparison”
Observational study = “What could you say with data that likely exist?”
Feedback will help you narrow down a topic for final project

Assignment 2

Prompt posted on website [here]{https://pols1600.paultesta.org/assignments/a2}
Uploaded to Canvas by Friday, March 22
Revised Research Question
Implied Linear Model
Code to load possible data

Feedback

Only 9 people completed the survey last week…
So let’s try to get those numbers up and will summarize the results of the survey next week

Packages for today

## Pacakges for today
the_packages <- c(
  ## R Markdown
  "kableExtra","DT","texreg",
  ## Tidyverse
  "tidyverse", "lubridate", "forcats", "haven", "labelled",
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr", 
  "GGally", "scales", "dagitty", "ggdag", "ggforce",
  # Data 
  "COVID19","maps","mapdata","qss","tidycensus", "dataverse", 
  # Analysis
  "DeclareDesign", "easystats", "zoo"
)

## Define a function to load (and if needed install) packages

#| label = "ipak"
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

## Install (if needed) and load libraries in the_packages
ipak(the_packages)

   kableExtra            DT        texreg     tidyverse     lubridate 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      forcats         haven      labelled         ggmap       ggrepel 
         TRUE          TRUE          TRUE          TRUE          TRUE 
     ggridges      ggthemes        ggpubr        GGally        scales 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      dagitty         ggdag       ggforce       COVID19          maps 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      mapdata           qss    tidycensus     dataverse DeclareDesign 
         TRUE          TRUE          TRUE          TRUE          TRUE 
    easystats           zoo 
         TRUE          TRUE

Lab 5 & Simple Linear Regression

Review: Key Concepts from the Lab

Review: Simple Linear Regression

Let’s pick up where we left off on Thursday. First we’ll need to run some code to get to recreate our data

# ---- Load data ----
## Covid-19 Data
load(url("https://pols1600.paultesta.org/files/data/covid.rda"))
## Presidential Election Data
load(url("https://pols1600.paultesta.org/files/data/pres_df.rda"))

# ---- Recode Covid Data ----
territories <- c(
  "American Samoa",
  "Guam",
  "Northern Mariana Islands",
  "Puerto Rico",
  "Virgin Islands"
  )

# Filter out Territories and create state variable
covid_us <- covid %>%
  filter(!administrative_area_level_2 %in% territories)%>%
  mutate(
    state = administrative_area_level_2
  )

# Calculate new cases, new cases per capita, and 7-day average

covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_cases = confirmed - lag(confirmed),
    new_cases_pc = new_cases / population *100000,
    new_cases_pc_7day = zoo::rollmean(new_cases_pc, 
                                     k = 7, 
                                     align = "right",
                                     fill=NA )
    ) -> covid_us

# Recode facemask policy

covid_us %>%
mutate(
  # Recode facial_coverings to create face_masks
    face_masks = case_when(
      facial_coverings == 0 ~ "No policy",
      abs(facial_coverings) == 1 ~ "Recommended",
      abs(facial_coverings) == 2 ~ "Some requirements",
      abs(facial_coverings) == 3 ~ "Required shared places",
      abs(facial_coverings) == 4 ~ "Required all times",
    ),
    # Turn face_masks into a factor with ordered policy levels
    face_masks = factor(face_masks,
      levels = c("No policy","Recommended",
                 "Some requirements",
                 "Required shared places",
                 "Required all times")
    ) 
    ) -> covid_us

# Create year-month and percent vaccinated variables

covid_us %>%
  mutate(
    year = year(date),
    month = month(date),
    year_month = paste(year, 
                       str_pad(month, width = 2, pad=0), 
                       sep = "-"),
    percent_vaccinated = people_fully_vaccinated/population*100  
    ) -> covid_us

# Recode Deaths
covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_deaths = deaths - lag(deaths),
    new_deaths_pc = new_deaths / population *100000,
    new_deaths_pc_7day = zoo::rollmean(new_deaths_pc, 
                                     k = 7, 
                                     align = "right",
                                     fill=NA ),
    new_deaths_pc_14day = zoo::rollmean(new_deaths_pc, 
                                     k = 14, 
                                     align = "right",
                                     fill=NA )
    ) -> covid_us

# ---- Recode Presidential Election Data ----

# Transform Presidential Election data
pres_df %>%
  mutate(
    year_election = year,
    state = str_to_title(state),
    # Fix DC
    state = ifelse(state == "District Of Columbia", "District of Columbia", state)
  ) %>%
  filter(party_simplified %in% c("DEMOCRAT","REPUBLICAN"))%>%
  filter(year == 2020) %>%
  select(state, state_po, year_election, party_simplified, candidatevotes, totalvotes
         ) %>%
  pivot_wider(names_from = party_simplified,
              values_from = candidatevotes) %>%
  mutate(
    dem_voteshare = DEMOCRAT/totalvotes*100,
    rep_voteshare = REPUBLICAN/totalvotes*100,
    winner = forcats::fct_rev(factor(ifelse(rep_voteshare > dem_voteshare,"Trump","Biden")))
  ) -> pres2020_df

# ---- Merge Data ----

dim(covid_us)

[1] 53678    60

dim(pres2020_df)

[1] 51  9

covid_us <- covid_us %>% left_join(
  pres2020_df,
  by = c("state" = "state")
)
dim(covid_us) # Same number of rows as covid_us w/ 8 additional columns

[1] 53678    68

Now let’s revisit question 6, which asked you to calculate some conditional means:

Overall
Before the vaccine was widely available
After the vaccine was widely available

# ---- Deaths: Overall ----
covid_us %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Overall"
  ) -> deaths_overall

# ---- Deaths: Pre Vaccine ----
covid_us %>%
  filter(date < "2021-04-19") %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Pre Vaccine"
  ) -> deaths_pre_vax

# ---- Deaths: Post Vaccine ----
covid_us %>%
  filter(date >= "2021-04-19") %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Post Vaccine"
  ) -> deaths_post_vax

# ---- Tidy outputs for display ----

deaths_tab <- deaths_overall %>% 
  bind_rows(
  deaths_pre_vax,
  deaths_post_vax
) %>% 
  mutate(
    comparison = factor(
      comparison,
      levels = c("Overall","Pre Vaccine", "Post Vaccine")
      )
  )

knitr::kable(deaths_tab) %>% 
  kableExtra::kable_styling()

winner	new_deaths	new_deaths_pc_7day	comparison
Trump	19.34867	0.3402479	Overall
Biden	22.04853	0.2871843	Overall
Trump	22.82048	0.4000294	Pre Vaccine
Biden	30.61051	0.3801906	Pre Vaccine
Trump	17.07402	0.3016571	Post Vaccine
Biden	16.30683	0.2257111	Post Vaccine

# 1. Data
deaths_tab %>% 
  # 2. Aesthetics
  ggplot(
    aes(winner, new_deaths_pc_7day,
        fill = winner)
  ) +
  # 3. Geometries
  geom_bar(stat = "identity") +
  # 4. Facets
  facet_grid(~ comparison) +
  # 5. Labels and Themes
  guides(fill = "none") +
  labs(
    x = "State's won by",
    y = "Average # of Deaths per 100k\n(7-day rolling average)",
    title = "Red States have more Covid-19 deaths per capita after vaccine"
  )+
  theme_minimal()+
  theme(title = element_text(size = 10,face = "bold")) -> fig_q6

Using OLS to estimate conditional means

Q7
Output
Table

Question 7 asked you estimate the following OLS models:

\[ \text{New Deaths} = \beta_0 + \beta_1 \text{Election Winner} + \epsilon \]

\[ \text{7-day average of New Deaths (per 100k)} = \beta_0 + \beta_1 \text{Election Winner} + \epsilon \]

Note

Recall winner is a factor whose levels we set to be c("Trump","Biden").

lm() converts factors into binary indicators. Here 0="Trump" and 1="Biden"

m1_lab <- lm(new_deaths ~ winner, covid_us)
m2_lab <- lm(new_deaths_pc_7day ~ winner, covid_us)

m1_lab


Call:
lm(formula = new_deaths ~ winner, data = covid_us)

Coefficients:
(Intercept)  winnerBiden  
      19.35         2.70

# Just the coefficients
coef(m2_lab)

(Intercept) winnerBiden 
 0.34024787 -0.05306357

# Coefficients with summary stats (for later)
summary(m2_lab)


Call:
lm(formula = new_deaths_pc_7day ~ winner, data = covid_us)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0016 -0.2205 -0.1340  0.0966  5.8550 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.340248   0.002487  136.83   <2e-16 ***
winnerBiden -0.053064   0.003475  -15.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3978 on 52447 degrees of freedom
  (1229 observations deleted due to missingness)
Multiple R-squared:  0.004427,  Adjusted R-squared:  0.004408 
F-statistic: 233.2 on 1 and 52447 DF,  p-value: < 2.2e-16

texreg::htmlreg(list(m1_lab,m2_lab),
                custom.header = list(
                  "DV:" = 1:2
                ),
                custom.model.names = c(
                  "new_deaths",
                  "new_deaths_pc_7day"
                ))

Statistical models
	DV:
	new_deaths	new_deaths_pc_7day
(Intercept)	19.35^***	0.34^***
	(0.39)	(0.00)
winnerBiden	2.70^***	-0.05^***
	(0.55)	(0.00)
R²	0.00	0.00
Adj. R²	0.00	0.00
Num. obs.	52755	52449
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Q7: When the CEF is Linear, OLS = CEF

-CEF
-CEF
-OLS
-OLS

If the CEF is linear, then OLS = CEF

In a saturated linear model, every group in the data can be represented by a coefficient or combination of coefficients in the model.

covid_us %>%
  mutate(
    biden01 = ifelse(winner == "Biden",1,0)
  ) %>% 
  ggplot(aes(biden01, new_deaths_pc_7day))+
  geom_jitter(size=.25,alpha=.15)+
  stat_summary(col="red")+
  stat_summary(aes(label=round(..y..,2)),
               geom = "text", 
               position = position_nudge(y = 1),
               col = "red",
               fun = mean) +
  theme_minimal()-> fig_q7cef

fig_q7cef +
  stat_smooth(method = "lm",se = F)+
  stat_summary(aes(label=round(..y..,2)),
               geom = "text", 
               position = position_nudge(y = .05),
               col = "red",
               fun = mean)+
  coord_cartesian(ylim=c(0,.45))+
  geom_segment(aes(
    x = 0,
    xend = 0,
    y = coef(m2_lab)[1],
    yend = coef(m2_lab)[1] + coef(m2_lab)[2]
  ),
  col = "blue",
  linetype = "dashed")+
  geom_segment(aes(
    x = 0,
    xend = 1,
    y = coef(m2_lab)[1] + coef(m2_lab)[2],
    yend = coef(m2_lab)[1] + coef(m2_lab)[2]
  ),
  col = "blue",
  linetype = "dashed")+
  annotate("text",
           label = expression(paste(beta[0]," = " )),
           x = 0, 
           y = coef(m2_lab)[1] + 0.075,
           hjust = "center",
           col = "blue"
           )+
    annotate("text",
           label = expression(paste(beta[1]," = 0.29 - 0.34 = -0.05" )),
           x = 0, 
           y = coef(m2_lab)[1] +coef(m2_lab)[2] - 0.05,
           hjust = "left",
           col = "blue"
           )+
  annotate("text",
           label = expression(
             paste(beta[0]," + ", beta[1], " = " )),
           x = 1, 
           y = coef(m2_lab)[1] + 0.05,
           hjust = "center",
           col = "blue"
           ) -> fig_q7ols

Q8: Vote shares, vaccinations, and deaths

Q9
-OLS
Results

Q9 asked you to fit three models exploring the relationship between:

\[ \text{m3} =\text{14-day average of New Deaths (per 100k)} = \beta_0 + \beta_1 \text{Percent Vaccinated} \] \[ \text{m4} =\text{Percent Vaccinated} = \beta_0 + \beta_1 \text{Republican Vote Share} \] \[ \text{m5} =\text{14-day average of New Deaths (per 100k)} = \beta_0 + \beta_1 \text{Republican Vote Share} \]

On September 23, 2021, when Leonhardt was writing

# Deaths modeled by percent vaccinated on 2021-09-23
m3_lab <- lm(new_deaths_pc_14day ~ percent_vaccinated, 
         covid_us,
         subset = date == "2021-09-23")

#  Percent vaccinated modeled by Republican vote share on 2021-09-23
m4_lab <- lm(percent_vaccinated ~ rep_voteshare, 
         covid_us,
         subset = date == "2021-09-23")

# Deaths modeled by Republican vote share on 2021-09-23
m5_lab <- lm(new_deaths_pc_14day ~ rep_voteshare, 
         covid_us,
         subset = date == "2021-09-23")

coef(m3_lab)

       (Intercept) percent_vaccinated 
         2.4232235         -0.0330773

coef(m4_lab)

  (Intercept) rep_voteshare 
   84.3326265    -0.5731175

coef(m5_lab)

  (Intercept) rep_voteshare 
  -0.36296799    0.01888996

Q9: Visualizing Vote Shares and Deaths

Q9
-Basic
-Basic
-Fetch
-Fetch

Q9 asks you to visualize the results of the model m5

Let’s take a basic figure, and make it fetch!

covid_us %>%
  # Only use observations from September 23, 2021
  filter(date == "2021-09-23") %>%
  # Exclude DC
  filter(state != "District of Columbia") %>%
  # Set aesthetics
  ggplot(aes(x = rep_voteshare,
             y = new_deaths_pc_14day))+
  # Set geometries
  geom_point(aes(col=rep_voteshare),size=2,alpha=.5)+
  # Include the linear regression of lm(new_deaths_pc_14day ~ rep_voteshare)
  geom_smooth(method = "lm", se=F,
              col = "grey", linetype =2) -> fig_m5_lab

fig_m5_lab

fig_m5_lab +
  # two way gradient, blue states -> blue, red states -> red, swing -> grey
  scale_color_gradient2(
    midpoint = 50,
    low = "blue", mid = "grey", high = "red",
    guide = "none")+
  # Vertical line at 50% threshold
  geom_vline(xintercept = 50, 
             col = "grey",linetype = 3)+
  # Add labels
  ggrepel::geom_text_repel(aes(label = state_po), size=2)+
  # theme with minimal lines
  theme_classic()+
  # Labels
  labs(
    x = "Republican Vote Share\n 2020 Presidential Election",
    y = "New Covid-19 Deaths per 100kresidents\n (14-day average)",
    title = "Partisan Gaps in Covid-19 Deaths at the State Level",
    subtitle = "Data from Sept. 23, 2021"
  ) -> fig_m5_lab_fetch

fig_m5_lab_fetch

Q10: Alternative Explanations

Finally Q10 asked you to consider some possible alternative explanations or omitted variables that might explain the positive relationship, between Republican Vote Share and Covid-19 deaths.
In this week’s lab, we’ll see how we can use multiple regression to evaluate these claims.

Previewing Lab 6

Red Covid

The core thesis of Red Covid is something like the following:

Since Covid-19 vaccines became widely available to the general public in the spring of 2021, Republicans have been less likely to get the vaccine. Lower rates of vaccination among Republicans have in turn led to higher rates of death from Covid-19 in Red States compared to Blue States.

Testing Alternative Explanations of Red Covid

A skeptic might argue that this relationship is spurious.
There are lots of ways that Red States differ from Blue States – demographics, economics, geography, culture – that might explain the differences in Covid-19 deaths.

Testing Alternative Explanations of Red Covid

In Lab 6, we use multiple regression to try and control for the following alternative explanations:
- Differences in median age
- Differences in median income
To do this, we need to merge in some additional state level data from the census.

Loading data from the Census

If you worked through the instructions here and installed tidycensus and a Census API key on your machine, you should be able to run the following:

acs_df <- tidycensus::get_acs(geography = "state", 
              variables = c(med_income = "B19013_001",
                            med_age = "B01002_001"), 
              year = 2019)

If not, no worries, just uncomment the code below:

# Uncomment if get_acs() doesn't work:
# load(url("https://pols1600.paultesta.org/files/data/acs_df.rda"))

Tidy Census Data

Task
-Old
-Tidy
-New

get_acs() returns data whose unit of analysis is roughly a state-variable

We need to reshape acs_df using pivot_wider() so that the unit of analysis is just a state, and each column variable is a column

acs_df

# A tibble: 104 × 5
   GEOID NAME       variable   estimate    moe
   <chr> <chr>      <chr>         <dbl>  <dbl>
 1 01    Alabama    med_age        39      0.2
 2 01    Alabama    med_income  50536    304  
 3 02    Alaska     med_age        34.3    0.1
 4 02    Alaska     med_income  77640   1015  
 5 04    Arizona    med_age        37.7    0.2
 6 04    Arizona    med_income  58945    266  
 7 05    Arkansas   med_age        38.1    0.1
 8 05    Arkansas   med_income  47597    328  
 9 06    California med_age        36.5    0.1
10 06    California med_income  75235    232  
# ℹ 94 more rows

acs_df %>%
  mutate(
    state = NAME,
  ) %>%
  select(state, variable, estimate) %>%
  pivot_wider(names_from = variable,
              values_from = estimate) -> acs_df

acs_df

# A tibble: 52 × 3
   state                med_age med_income
   <chr>                  <dbl>      <dbl>
 1 Alabama                 39        50536
 2 Alaska                  34.3      77640
 3 Arizona                 37.7      58945
 4 Arkansas                38.1      47597
 5 California              36.5      75235
 6 Colorado                36.7      72331
 7 Connecticut             41        78444
 8 Delaware                40.6      68287
 9 District of Columbia    34        86420
10 Florida                 42        55660
# ℹ 42 more rows

Merge Census data into Covid data

Task
-Merge

Now we can merge acs_df into covid_us using the left_join() function

Note

In the code, we’ll save the output of joining acs_df to covid_us to a new dataframe called covid_df to avoid potentially duplicating columns

dim(acs_df)

[1] 52  3

dim(acs_df)

[1] 52  3

dim(covid_us)

[1] 53678    68

# Merge tmp with acs_df and save as final covid_df file
covid_df <- covid_us %>% left_join(
  acs_df,
  by = c("state" = "state")
)
dim(covid_df)  # Same number of rows as tmp w/ 2 additional columns

[1] 53678    70

Subset Data

Task
-Subset
-Subsetted Data

Next, we’ll subset the data to include just the values from variables we’ll use in the lab on from September 23, 2021 using:

filter()
select()

Note

Again, we’ll save this output to a new object called covid_lab

the_vars <- c(
  # Covid variables
  "state","state_po","date","new_deaths_pc_14day", "percent_vaccinated",
  # Election variables
  "winner","rep_voteshare",
  # Demographic variables
  "med_age","med_income","population")


covid_lab <- covid_df %>%
  filter( date == "2021-09-23") %>%
  select(all_of(the_vars)) %>%
  ungroup()

length(the_vars)

[1] 10

dim(covid_lab)

[1] 51 10

covid_lab

# A tibble: 51 × 10
   state       state_po date       new_deaths_pc_14day percent_vaccinated winner
   <chr>       <I<chr>> <date>                   <dbl>              <dbl> <fct> 
 1 Minnesota   MN       2021-09-23               0.222               61.0 Biden 
 2 California  CA       2021-09-23               0.295               60.8 Biden 
 3 Florida     FL       2021-09-23               1.61                58.4 Trump 
 4 Wyoming     WY       2021-09-23               0.938               43.9 Trump 
 5 South Dako… SD       2021-09-23               0.291               52.4 Trump 
 6 Kansas      KS       2021-09-23               0.559               53.9 Trump 
 7 Nevada      NV       2021-09-23               0.700               51.8 Biden 
 8 Virginia    VA       2021-09-23               0.379               62.5 Biden 
 9 Washington  WA       2021-09-23               0.532               63.2 Biden 
10 Oregon      OR       2021-09-23               0.439               62.3 Biden 
# ℹ 41 more rows
# ℹ 4 more variables: rep_voteshare <dbl>, med_age <dbl>, med_income <dbl>,
#   population <int>

Standardized Variables

When numeric variables are measured on different scales, it can be useful to construct standardized measures, sometimes called z-scores

\[z\text{-scores of x} = \frac{x_i - \mu_{x}}{\sigma_x}\]

The z-score of Age is

\[z\text{-scores of Age} = \frac{\text{Age}_i - \text{Average Age}}{\text{Standard Deviation of Age}}\]

Note

Standardized variables all have a mean of 0 and a standard deviation of 1
Standardizing variables helps us interpret coefficients in multiple regressions with predictors measured on different scales

Standardizing variables for the lab

Task
-Standardize
-Compare
-Compare

Create standardized versions of

rep_voteshare
med_age
med_income
percent_vaccinated

Note

We’ll use the suffix _std to distinguish the standardized variables from the original variables

covid_lab %>%
  mutate(
    rep_voteshare_std = (rep_voteshare - mean(rep_voteshare)) / sd(rep_voteshare),
    med_age_std = ( med_age - mean( med_age)) / sd( med_age),
    med_income_std = (med_income - mean(med_income)) / sd(med_income),
    percent_vaccinated_std = (percent_vaccinated - mean(percent_vaccinated)) / sd(percent_vaccinated)
  ) -> covid_lab

covid_lab %>%
  summarise(
    mn_rep_vote = mean(rep_voteshare),
    mn_rep_vote_std = round(mean(rep_voteshare_std),2),
    sd_rep_vote = sd(rep_voteshare),
    sd_rep_vote_std = sd(rep_voteshare_std)
  )

# A tibble: 1 × 4
  mn_rep_vote mn_rep_vote_std sd_rep_vote sd_rep_vote_std
        <dbl>           <dbl>       <dbl>           <dbl>
1        49.2               0        12.0               1

covid_lab %>% 
  ggplot(aes(rep_voteshare_std,rep_voteshare))+
  geom_point()

Save Data

Finally, I’ll save the data for Thursday’s lab

# Don't run this code
save(covid_lab, file = "../files/data/06_lab.rda")

And on Thursday, we’ll be able to load the covid_lab just by running:

load(url("https://pols1600.paultesta.org/files/data/06_lab.rda"))

Multiple Regression

Conceptual: Multiple Regression

Multiple linear regression generalizes simple regression to models with multiple predictors

\[y = \beta_0 + \beta_1x_{1} +\beta_2x_{2} \dots \beta_kx_k + \epsilon\] More compactly written in matrix notation:

\[y = X\beta + \epsilon\] Where:

\[ \overbrace{\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}}^{y} = \overbrace{ \begin{bmatrix} 1 & x_{1,1}&\dots & x_{1,k}\\ 1 & x_{2,1}&\dots & x_{2,k}\\ \vdots&\vdots& & \vdots \\ 1 & x_{n,1}&\dots & x_{n,k}\\ \end{bmatrix}}^{X} \overbrace{\begin{bmatrix} \beta_0\\ \beta_1\\ \vdots \\ \beta_k\\ \end{bmatrix}}^{\beta} + \overbrace{\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}}^{\epsilon} \]

Conceptual: Multiple Regression

Regression models partition variance
Separate variation in the outcome (\(y\)) into variation explained by the predictors in the model and the residual variation not explained by these predictors
Regression coefficients tell us how the outcome \(y\) is expected to change if \(x\) changes by one unit, holding constant or controlling for other predictors in the model.

Practical: Multiple Regression

We estimate linear models in R using the lm() function using the + to add predictors
We use the * to include the main effects \((\beta_1 x, \beta_2z)\) and interactions \((\beta_3 (x\cdot z))\)of two predictors

lm(y ~ x + z, data = df)
lm(y ~ x*z, data = df) # Is a shortcut for:
lm(y ~ x + z + x:z, data = df)

Technical: Multiple Regression

Simple linear regression chooses a \(\hat{\beta_0}\) and \(\hat{\beta_1}\) to minimize the Sum of Squared Residuals (SSR):

\[\textrm{Find }\hat{\beta_0},\,\hat{\beta_1} \text{ arg min}_{\beta_0, \beta_1} \sum (y_i-(\hat{\beta_0}+\hat{\beta_1}x_i))^2\]

Multiple linear regression chooses a vector of coefficients \(\hat{\beta}\) to minimize the Sum of Squared Residuals (SSR):

\[\textrm{Find }\widehat{\beta} \text{ argmin }_{\widehat{\beta}} \sum \epsilon^2=\epsilon^\prime\epsilon=(y-X\widehat{\beta})^\prime(y-X\widehat{\beta})\]

Theoretical: Multiple Regression

Multiple Linear regression still provides a linear estimate of the conditional expectation function (CEF): \(E[Y|X]\) where \(Y\) is now a function of multiple predictors, \(X\)

Estimating and Interpretting Multiple Regression

Estimating and Interpretting Multiple Regressions

Task
NES
HLO
Wrangle

Let’s load, inspect, and recode some data from the 2016 NES and explore the relationship between political interest and evaluations of presidential candidates:

Political Interest: “How interested are you in in politics?
- Very Interested
- Somewhat interested
- Not very Interested
- Not at all Interested
Feeling Thermometer: “… On the feeling thermometer scale of 0 to 100, how would you rate”
- Donald Trump
- Hillary Clinton

# Load data from 2016 NES
load(url("https://pols1600.paultesta.org/files/data/nes.rda"))

# Take a quick look at the data
dim(nes)

[1] 1200   14

head(nes)

  caseid    state age gender educ faminc pid7 ideo5 pol_interest church_atd
1    745  Alabama  19      2    3     97    5     3            1          2
2   1115  Alabama  46      1    3      3    1     1            3          6
3    258  Alabama  59      2    2      6    2     3            2          1
4    126  Alabama  55      2    4      6    1     2            3          5
5    414  Alabama  66      1    3      8    7     4            3          3
6    523  Alabama  61      1    2     97    1     2            3          6
  bornagain01 ft_trump ft_hrc vote_choice
1           0        3     19       Other
2           0        0     36         HRC
3           1       22      2        <NA>
4           0        1     80         HRC
5           1      100      3        <NA>
6           1        0    100        <NA>

table(nes$pol_interest)


  0   1   2   3 
 78 178 348 568

summary(nes$ft_trump)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    2.00   30.00   38.38   72.00  100.00       3

summary(nes$ft_hrc)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   0.00    3.00   44.00   42.99   76.00  100.00       1

nes %>%
  mutate(
    income = ifelse(faminc > 16, NA, faminc),
    interested = ifelse(pol_interest==3,T,F),
    pol_interest_f = factor(case_when(
      pol_interest == 0 ~ "Not at all Interested",
      pol_interest == 1 ~ "Not very Interested",
      pol_interest == 2 ~ "Somewhat Interested",
      pol_interest == 3 ~ "Very Interested"
    )),
    tc_diff = abs(ft_trump - ft_hrc)
  ) -> nes

Estimating and Interpreting Models

Models
Skills

Let’s estimate the following models:

\[\text{m1: tc_diff} = \beta_0 + \beta_1\text{interested}+\epsilon \]

\[\text{m2: tc_diff} = \beta_0 + \beta_1\text{pol_interest}+\epsilon\]

\[\text{m3: tc_diff} = \beta_0 + \beta_1\text{pol_interest_f} +\epsilon\]

\[\text{m4: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} +\epsilon\]

\[\text{m5: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} + \beta_3\text{interested} \times \text{age}+\epsilon\]

\[\text{m6: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} +\epsilon\]

\[\text{m7: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} + \beta_4\text{age}\times \text{income}+\epsilon \]

We’ll use:

lm() to estimate models
coef() and summary() to examine our results
htmlreg()to format our results
data.frame() to create prediction dataframes
predict() to produce predicted values and cbind() to combine these predicted back into the prediction dataframes for plotting
ggplot() to display and interpret our models’ predictions

Regression Tables

Academic articles are littered with regression tables.
Below we’ll see how to:
- produce regression tables in R
- use some heuristics to interpret regression tables
We’ll cover the why with greater depth and care later in the course, for now, let’s focus on the what and how

Making Regression Tables in R

Task

We can make a very simple regression table using the htmlreg function from the texreg package

Note

To properly display your results you need to add #| results: asis the following to the code chunk header

```{r}
#| echo: false
#| results: asis
texreg::htmlreg(
  list(m1_lab,m2_lab, m3_lab,m4_lab,m5_lab)
)
```

Statistical models
	Model 1	Model 2	Model 3	Model 4	Model 5
(Intercept)	19.35^***	0.34^***	2.42^***	84.33^***	-0.36
	(0.39)	(0.00)	(0.30)	(2.66)	(0.20)
winnerBiden	2.70^***	-0.05^***
	(0.55)	(0.00)
percent_vaccinated			-0.03^***
			(0.01)
rep_voteshare				-0.57^***	0.02^***
				(0.05)	(0.00)
R²	0.00	0.00	0.44	0.71	0.31
Adj. R²	0.00	0.00	0.43	0.70	0.29
Num. obs.	52755	52449	51	51	51
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Interpreting Regression Tables (Stargazing)

Each column is a model
Each row is a coefficient from that model with its standard error (more to come) in parentheses below
We interpret coefficients by looking at their sign, size, and significance
- Coefficients with asterisks * are statistically significant (more to come)
- It is unlikely that we would see a coefficient this big or bigger if the true coefficient were 0
Rule of thumb:

\[\text{If }\frac{\beta}{se} > 2 \to \text{Statistically Significant}\]

`m1`: A binary indicator

m1
-Table
-Figure
-m1

\[\text{m1: tc_diff} = \beta_0 + \beta_1\text{interested}+\epsilon \]

m1 <- lm(tc_diff ~ interested, nes)
coef(m1)

   (Intercept) interestedTRUE 
      47.84500       12.74655

mean(nes$tc_diff[nes$interested == F], na.rm=T)

[1] 47.845

mean(nes$tc_diff[nes$interested == T], na.rm=T)

[1] 60.59155

mean(nes$tc_diff[nes$interested == T], na.rm=T) -
  mean(nes$tc_diff[nes$interested == F], na.rm=T)

[1] 12.74655

htmlreg(m1)

Statistical models
	Model 1
(Intercept)	47.84^***
	(1.26)
interestedTRUE	12.75^***
	(1.81)
R²	0.04
Adj. R²	0.04
Num. obs.	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df1 <- expand_grid(
  interested = c(F,T)
)

pred_df1  <- cbind(
  pred_df1 ,
  fit = predict(m1, pred_df1 )
)

# Produce figure

pred_df1 %>% 
  ggplot(aes(interested, fit))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)),
    aes(x= interested,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(col="red") -> fig_tc_m1

`m2`: A numerical predictor

m2
-Table
-Figure
-m2

\[\text{m2: tc_diff} = \beta_0 + \beta_1\text{pol_interest}+\epsilon\]

m2 <- lm(tc_diff ~ pol_interest, nes)
round(coef(m2),2)

 (Intercept) pol_interest 
       40.80         6.01

htmlreg(list(m1,m2))

Statistical models
	Model 1	Model 2
(Intercept)	47.84^***	40.80^***
	(1.26)	(2.35)
interestedTRUE	12.75^***
	(1.81)
pol_interest		6.01^***
		(0.98)
R²	0.04	0.03
Adj. R²	0.04	0.03
Num. obs.	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df2 <- expand_grid(
  pol_interest = na.omit(sort(unique(nes$pol_interest)))
)

pred_df2  <- cbind(
  pred_df2 ,
  fit = predict(m2, pred_df2 )
)

# Produce figure

pred_df2 %>% 
  ggplot(aes(pol_interest, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(pol_interest)),
    aes(x= pol_interest,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_line(col="red") +
  geom_point(col="red") -> fig_tc_m2

`m3`: A categorical indicator

m3
lm()
-Table
-Figure
-m3
-m2 vs m3

\[\text{m3: tc_diff} = \beta_0 + \beta_1\text{pol_interest_f} +\epsilon\]

m3 <- lm(tc_diff ~ pol_interest_f, nes)
round(coef(m3),2)

                      (Intercept) pol_interest_fNot very Interested 
                            46.10                              1.73 
pol_interest_fSomewhat Interested     pol_interest_fVery Interested 
                             2.14                             14.49

nes %>% 
  group_by(pol_interest_f) %>%
  filter(!is.na(pol_interest_f)) %>% 
  summarise(
    mean = mean(tc_diff,na.rm = T),
    beta = round(mean - coef(m3)[1],3)
  )

# A tibble: 4 × 3
  pol_interest_f         mean  beta
  <fct>                 <dbl> <dbl>
1 Not at all Interested  46.1  0   
2 Not very Interested    47.8  1.73
3 Somewhat Interested    48.2  2.14
4 Very Interested        60.6 14.5

lm() converts the factor pol_interest_f into binary indicators for every value of pol_interest_f excluding “Not at all Interested”, the first level of the factor.

“Not at all Interested” is the reference category because all the other coefficients describe how the means for other levels of pol_interest_f differ from “Not at all Interested”

cbind(
m3$model[26:30,],
model.matrix(m3)[26:30,]
)

   tc_diff        pol_interest_f (Intercept) pol_interest_fNot very Interested
26      96 Not at all Interested           1                                 0
27      93   Somewhat Interested           1                                 0
28      95       Very Interested           1                                 0
29      75   Somewhat Interested           1                                 0
30       1   Not very Interested           1                                 1
   pol_interest_fSomewhat Interested pol_interest_fVery Interested
26                                 0                             0
27                                 1                             0
28                                 0                             1
29                                 1                             0
30                                 0                             0

htmlreg(list(m1,m2,m3))

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	47.84^***	40.80^***	46.10^***
	(1.26)	(2.35)	(3.53)
interestedTRUE	12.75^***
	(1.81)
pol_interest		6.01^***
		(0.98)
pol_interest_fNot very Interested			1.73
			(4.23)
pol_interest_fSomewhat Interested			2.14
			(3.90)
pol_interest_fVery Interested			14.49^***
			(3.76)
R²	0.04	0.03	0.04
Adj. R²	0.04	0.03	0.04
Num. obs.	1168	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df3 <- expand_grid(
  pol_interest_f = na.omit(sort(unique(nes$pol_interest_f)))
)

pred_df3  <- cbind(
  pred_df3 ,
  fit = predict(m3, pred_df3 )
)

# Produce figure

pred_df3 %>% 
  ggplot(aes(pol_interest_f, fit,
             col = pol_interest_f))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(pol_interest_f)),
    aes(x= pol_interest_f,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point() -> fig_tc_m3

`m4`: A binary and numerical predictor

m4
-Table
-Figure
-m4
-m4 in 3d

\[\text{m4: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} +\epsilon\]

m4 <- lm(tc_diff ~ interested + age, nes)
coef(m4)

   (Intercept) interestedTRUE            age 
    32.1757944      9.7739375      0.3533876

htmlreg(list(m1, m4))

Statistical models
	Model 1	Model 2
(Intercept)	47.84^***	32.18^***
	(1.26)	(2.71)
interestedTRUE	12.75^***	9.77^***
	(1.81)	(1.84)
age		0.35^***
		(0.05)
R²	0.04	0.07
Adj. R²	0.04	0.07
Num. obs.	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df4 <- expand_grid(
  interested = c(F,T),
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10)
)

pred_df4  <- cbind(
  pred_df4 ,
  fit = predict(m4, pred_df4 )
)

# Produce figure

pred_df4 %>% 
  ggplot(aes(age, fit,
             col = interested,
             group = interested))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() -> fig_tc_m4

`m5`: An interaction between a binary and numerical predictor

m5
-Table
-Figure
-m5
-m5 in 3d

\[\text{m5: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} + \beta_3\text{interested} \times \text{age}+\epsilon\]

m5 <- lm(tc_diff ~ interested*age, nes)
coef(m5)

       (Intercept)     interestedTRUE                age interestedTRUE:age 
        39.7626421         -6.7040163          0.1822814          0.3396523

htmlreg(list(m1, m4, m5))

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	47.84^***	32.18^***	39.76^***
	(1.26)	(2.71)	(3.62)
interestedTRUE	12.75^***	9.77^***	-6.70
	(1.81)	(1.84)	(5.55)
age		0.35^***	0.18^*
		(0.05)	(0.08)
interestedTRUE:age			0.34^**
			(0.11)
R²	0.04	0.07	0.08
Adj. R²	0.04	0.07	0.08
Num. obs.	1168	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df5 <- expand_grid(
  interested = c(F,T),
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10)
)

pred_df5  <- cbind(
  pred_df5 ,
  fit = predict(m5, pred_df5 )
)

# Produce figure

pred_df5 %>% 
  ggplot(aes(age, fit,
             col = interested,
             group = interested))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() -> fig_tc_m5

`m6`: Two numerical predictors

m6
-Table
-Figure
-m6
-m6 in 3d

\[\text{m6: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} +\epsilon\]

m6 <- lm(tc_diff ~ age + income, nes)
coef(m6)

(Intercept)         age      income 
 33.1994121   0.4002682   0.1573334

htmlreg(list( m6))

Statistical models
	Model 1
(Intercept)	33.20^***
	(3.35)
age	0.40^***
	(0.06)
income	0.16
	(0.29)
R²	0.04
Adj. R²	0.04
Num. obs.	1049
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df6_age <- expand_grid(
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10),
  # Hold income constant at mean value
  income = mean(nes$income,na.rm=T)
)

pred_df6_income <- expand_grid(
  income = seq(min(nes$income, na.rm = T),
            max(nes$income, na.rm=T),
            length.out = 16),
  # Hold income constant at mean value
  age = mean(nes$age,na.rm=T)
)

pred_df6_age  <- cbind(
  pred_df6_age ,
  fit = predict(m6, pred_df6_age )
)

pred_df6_income  <- cbind(
  pred_df6_income ,
  fit = predict(m6, pred_df6_income )
)
# Produce figure

pred_df6_age %>% 
  ggplot(aes(age, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() +
  labs(title ="Age holding income constant") -> fig_tc_m6_age

pred_df6_income %>% 
  ggplot(aes(income, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= income,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() +
  labs(title = "Income holding age constant") -> fig_tc_m6_income


fig_tc_m6 <- ggpubr::ggarrange(fig_tc_m6_age, fig_tc_m6_income)

`m7`: Interaction between two numerical predictors

m7
-Table
-Figure
-m7
-m7 in 3d

\[\text{m7: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} + \beta_4\text{age}\times \text{income}+\epsilon \]

m7 <- lm(tc_diff ~ age*income, nes)
coef(m7)

(Intercept)         age      income  age:income 
38.22449671  0.29406374 -0.78888643  0.01993978

htmlreg(list( m6, m7))

Statistical models
	Model 1	Model 2
(Intercept)	33.20^***	38.22^***
	(3.35)	(5.80)
age	0.40^***	0.29^*
	(0.06)	(0.12)
income	0.16	-0.79
	(0.29)	(0.94)
age:income		0.02
		(0.02)
R²	0.04	0.05
Adj. R²	0.04	0.04
Num. obs.	1049	1049
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df7 <- expand_grid(
  age = seq(19, 95,
            by = 4),
  # Hold income constant at mean value
  income = seq(min(nes$income, na.rm = T),
            max(nes$income, na.rm=T),
            length.out = 16)
)



pred_df7  <- cbind(
  pred_df7 ,
  fit = predict(m7, pred_df7 )
)

# Produce figure

# Marginal effect of age at min, median, and max values of income
pred_df7 %>%
  mutate(
    income_at = case_when(
      income == 1 ~ "01",
      income == median(nes$income,na.rm=T) ~ "05",
      income == 16 ~ "16",
      T ~ NA_character_
    )
  ) %>% 
  filter(!is.na(income_at)) %>% 
  ggplot(aes(age, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(aes(col =income_at,
                 group = income_at))+
  geom_line(aes(col =income_at,
                group = income_at)
            ) -> fig_tc_m7_age

# Marginal effect of income at min, median, and max values of age
pred_df7 %>%
  mutate(
    age_at = case_when(
      age == min(nes$age, na.rm=T) ~ "19",
      age == 47 ~ "47", # close enough ...
      age ==  max(nes$age, na.rm=T) ~ "95",
      T ~ NA_character_
    )
  ) %>% 
  filter(!is.na(age_at)) %>% 
  ggplot(aes(income, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= income,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(aes(col =age_at,
                 group = age_at))+
  geom_line(aes(col =age_at,
                group = age_at)
            ) -> fig_tc_m7_income


fig_tc_m7 <- ggpubr::ggarrange(fig_tc_m7_age, 
                               fig_tc_m7_income,
                               legend = "bottom")

Difference-in-Differences

Motivating Example: What causes Cholera?

In the 1800s, cholera was thought to be transmitted through the air.
John Snow (the physician, not the snack), to explore the origins eventunally concluding that cholera was transmitted through living organisms in water.
Leveraged a natural experiment in which one water company in London moved its pipes further upstream (reducing contamination for Lambeth), while other companies kept their pumps serving Southwark and Vauxhall in the same location.

Notation

Let’s adopt a little notation to help us think about the logic of Snow’s design:

\(D\): treatment indicator, 1 for treated neighborhoods (Lambeth), 0 for control neighborhoods (Southwark and Vauxhall)
\(T\): period indicator, 1 if post treatment (1854), 0 if pre-treatment (1849).
\(Y_{di}(t)\) the potential outcome of unit \(i\)
- \(Y_{1i}(t)\) the potential outcome of unit \(i\) when treated between the two periods
- \(Y_{0i}(t)\) the potential outcome of unit \(i\) when control between the two periods

Causal Effects

The individual causal effect for unit i at time t is:

\[\tau_{it} = Y_{1i}(t) − Y_{0i}(t)\]

What we observe is

\[Y_i(t) = Y_{0i}(t)\cdot(1 − D_i(t)) + Y_{1i}(t)\cdot D_i(t)\]

\(D\) only equals 1, when \(T\) equals 1, so we never observe \(Y_0i(1)\) for the treated units.

In words, we don’t know what Lambeth’s outcome would have been in the second period, had they not been treated.

Average Treatment on Treated

Our goal is to estimate the average effect of treatment on treated (ATT):

\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]

That is, what would have happened in Lambeth, had their water company not moved their pipes

Average Treatment on Treated

Our goal is to estimate the average effect of treatment on treated (ATT):

We we can observe is:

	Pre-Period (T=0)	Post-Period (T=1)
Treated \(D_{i}=1\)	\(E[Y_{0i}(0)\vert D_i = 1]\)	\(E[Y_{1i}(1)\vert D_i = 1]\)
Control \(D_i=0\)	\(E[Y_{0i}(0)\vert D_i = 0]\)	\(E[Y_{0i}(1)\vert D_i = 0]\)

Data

Because potential outcomes notation is abstract, let’s consider a modified description of the Snow’s cholera death data from Scott Cunningham:

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

How can we estimate the effect of moving pumps upstream?

Recall, our goal is to estimate the effect of the the treatment on the treated:

\[\tau_{ATT} = E[Y_{1i}(1) - Y_{0i}(1)|D=1]\]

Let’s conisder some strategies Snow could take to estimate this quantity:

Before vs after comparisons:

Snow could have compared Labmeth in 1854 \((E[Y_i(1)|D_i = 1] = 19)\) to Lambeth in 1849 \((E[Y_i(0)|D_i = 1]=85)\), and claimed that moving the pumps upstream led to 66 fewer cholera deaths.
Assumes Lambeth’s pre-treatment outcomes in 1849 are a good proxy for what its outcomes would have been in 1954 if the pumps hadn’t moved \((E[Y_{0i}(1)|D_i = 1])\).
A skeptic might argue that Lambeth in 1849 \(\neq\) Lambeth in 1854

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

Treatment-Control comparisons in the Post Period.

Snow could have compared outcomes between Lambeth and S&V in 1954 (\(E[Yi(1)|Di = 1] − E[Yi(1)|Di = 0]\)), concluding that the change in pump locations led to 128 fewer deaths.
Here the assumption is that the outcomes in S&V and in 1854 provide a good proxy for what would have happened in Lambeth in 1954 had the pumps not been moved \((E[Y_{0i}(1)|D_i = 1])\)
Again, our skeptic could argue Lambeth \(\neq\) S&V

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

Difference in Differences

To address these concerns, Snow employed what we now call a difference-in-differences design,

There are two, equivalent ways to view this design.

\[\underbrace{\{E[Y_{i}(1)|D_{i} = 1] − E[Y_{i}(1)|D_{i} = 0]\}}_{\text{1. Treat-Control |Post }}− \overbrace{\{E[Y_{i}(0)|D_{i} = 1] − E[Y_{i}(0)|D_{i}=0 ]}^{\text{Treated-Control|Pre}}\]

Difference 1: Average change between Treated and Control in Post Period
Difference 2: Average change between Treated and Control in Pre Period

Difference in Differences

\[\underbrace{\{E[Y_{i}(1)|D_{i} = 1] − E[Y_{i}(1)|D_{i} = 0]\}}_{\text{1. Treat-Control |Post }}− \overbrace{\{E[Y_{i}(0)|D_{i} = 1] − E[Y_{i}(0)|D_{i}=0 ]}^{\text{Treated-Control|Pre}}\] Is equivalent to:

\[\underbrace{\{E[Y_{i}(1)|D_{i} = 1] − E[Y_{i}(0)|D_{i} = 1]\}}_{\text{Post - Pre |Treated }}− \overbrace{\{E[Y_{i}(1)|D_{i} = 0] − E[Y_{i}(0)|D_{i}=0 ]}^{\text{Post-Pre|Control}}\]

Difference 1: Average change between Treated over time
Difference 2: Average change between Control over time

Difference in Differences

You’ll see the DiD design represented both ways, but they produce the same result:

\[ \tau_{ATT} = (19-147) - (85-135) = -78 \]

\[ \tau_{ATT} = (19-85) - (147-135) = -78 \]

Identifying Assumption of a Difference in Differences Design

The key assumption in this design is what’s known as the parallel trends assumption: \(E[Y_{0i}(1) − Y_{0i}(0)|D_i = 1] = E[Y_{0i}(1) − Y_{0i}(0)|D_i = 0]\)

In words: If Lambeth hadn’t moved its pumps, it would have followed a similar path as S&V

Parallel Trends

Using lm to estimate Diff-in-Diff

Data
lm()
Diff-in-Diff

cholera_df <- tibble(
  Location = c("S&V","Lambeth","S&V","Lambeth"),
  Treated = c(0,1,0, 1),
  Time = c(0,0,1,1),
  Deaths = c(135,85,147,19)
)
cholera_df

# A tibble: 4 × 4
  Location Treated  Time Deaths
  <chr>      <dbl> <dbl>  <dbl>
1 S&V            0     0    135
2 Lambeth        1     0     85
3 S&V            0     1    147
4 Lambeth        1     1     19

\[ \text{Deaths} = \beta_0 + \beta_1\text{Treated} + \beta_2\text{Time} + \beta_3 \text{Treated}\times\text{Time} \]

diff_in_diff <- lm(Deaths ~ Treated + Time + Treated:Time, cholera_df)
diff_in_diff


Call:
lm(formula = Deaths ~ Treated + Time + Treated:Time, data = cholera_df)

Coefficients:
 (Intercept)       Treated          Time  Treated:Time  
         135           -50            12           -78

\(\beta_0=\) Outcome in control (S&V) before treatment
\(\beta_1=\) Fixed, time invariant differences between treated and control
\(\beta_2=\) Fixed, unit invariant differences between pre and post periods
\(\beta_3=\) Difference-in-Differences = \(E[Y_{1i}(1) - Y_{0i}(1)|D=1]\)

Summary

A Difference in Differences (DiD, or diff-in-diff) design combines a pre-post comparison, with a treated and control comparison
Differencing twice accounts for fixed differences across units and between periods
- But not time varying differences across units…
The key identifying assumption of a DiD design is the assumption of parallel trends
- Absent treatment, treated and control groups would see the same changes over time.
- Hard to prove, possible to test

Extensions and limitations

Diff-in-Diff easy to estimate with linear regression
Generalizes to multiple periods and treatment interventions
- More pre-treatment periods allow you assess “parallel trends” assumption
Alternative methods
- Synthetic control
- Event Study Designs
What if you have multiple treatments or treatments that come and go?
- Panel Matching
- Generalized Synthetic control

Applications

Card and Krueger (1994) What effect did raising the minimum wage in NJ have on employment
Abadie, Diamond, & Hainmueller (2014) What effect did German Unification have on economic development in West Germany
Malesky, Nguyen and Tran (2014) How does decentralization influence public services?

Summary

Summary - Linear Regression

Linear regression produces linear estimates of the Conditional Expectation Function
We estimate linear regression using lm()

m1 <- lm( y ~ x1 + x2 + x3, data = df)

We interpret linear regression by: looking at the sign, size, and, eventually, significance of coefficients
- the intercept \((\beta_0)\) corresponds to the model’s prediction when every other predictor is zero
- the other \(\beta\)s describe how \(y\) changes with a [unit change] in \(x\), controlling for other predictors in the model
We present our results using regression tables and figures showing predicted values

Summary - Difference-in-Differences

Difference-in-Differences (DiD) is a powerful research design for observational data that combines a pre-post comparisons with a treated and control comparisons
Differencing twice accounts for fixed differences across units and between periods
DiD relies on an assumption of parallel trends
We can use linear regression to estimate and generalize DiD designs

POLS 1600

Overview

Class Plan

Annoucements

Assignment 1

Assignment 2

Feedback

Packages for today

Lab 5 & Simple Linear Regression

Review: Key Concepts from the Lab

Review: Simple Linear Regression

Calculating Conditional Means

Using OLS to estimate conditional means

Q7: When the CEF is Linear, OLS = CEF

Q8: Vote shares, vaccinations, and deaths

Q9: Visualizing Vote Shares and Deaths

Q10: Alternative Explanations

Previewing Lab 6

Red Covid

Testing Alternative Explanations of Red Covid

Testing Alternative Explanations of Red Covid

Loading data from the Census

Tidy Census Data

Merge Census data into Covid data

Subset Data

Standardized Variables

Standardizing variables for the lab

Save Data

Multiple Regression

Conceptual: Multiple Regression

Conceptual: Multiple Regression

Practical: Multiple Regression

Technical: Multiple Regression

Theoretical: Multiple Regression

Estimating and Interpretting Multiple Regression

Estimating and Interpretting Multiple Regressions

Estimating and Interpreting Models

Regression Tables

Making Regression Tables in R

Interpreting Regression Tables (Stargazing)

m1: A binary indicator

m2: A numerical predictor

m3: A categorical indicator

m4: A binary and numerical predictor

m5: An interaction between a binary and numerical predictor

m6: Two numerical predictors

m7: Interaction between two numerical predictors

Difference-in-Differences

Motivating Example: What causes Cholera?

Notation

Causal Effects

Average Treatment on Treated

Average Treatment on Treated

Data

How can we estimate the effect of moving pumps upstream?

Before vs after comparisons:

Treatment-Control comparisons in the Post Period.

Difference in Differences

Difference in Differences

Difference in Differences

Identifying Assumption of a Difference in Differences Design

Parallel Trends

Using lm to estimate Diff-in-Diff

Summary

Extensions and limitations

Applications

Summary

Summary - Linear Regression

Summary - Difference-in-Differences

References

`m1`: A binary indicator

`m2`: A numerical predictor

`m3`: A categorical indicator

`m4`: A binary and numerical predictor

`m5`: An interaction between a binary and numerical predictor

`m6`: Two numerical predictors

`m7`: Interaction between two numerical predictors