POLS 1600

Multiple Regression

Updated Apr 22, 2025

Overview

Class Plan

Announcements (20)
- Assignment 1: Research Questions graded by Thursday, March 6
- Assignment 2 Data: now due Sunday March 16
- Assignment 3 Data: now due Sunday March 30 ? (Or should it be right before spring break?)
Review: Simple Linear Regression and Lab 5 (15-20 min)
Preview: Setup for Lab 6 (10-15 min)
Estimating and Interpreting Multiple Regression (25-30 min)
Difference-in-Differences (15-20 min)

Lab 5 & Simple Linear Regression

Review: Simple Linear Regression

Let’s pick up where we left off on Thursday. First we’ll need to run some code to get to recreate our data

# ---- Load data ----
## Covid-19 Data
load(url("https://pols1600.paultesta.org/files/data/covid.rda"))
## Presidential Election Data
load(url("https://pols1600.paultesta.org/files/data/pres_df.rda"))

# ---- Recode Covid Data ----
territories <- c(
  "American Samoa",
  "Guam",
  "Northern Mariana Islands",
  "Puerto Rico",
  "Virgin Islands"
  )

# Filter out Territories and create state variable
covid_us <- covid %>%
  filter(!administrative_area_level_2 %in% territories)%>%
  mutate(
    state = administrative_area_level_2
  )

# Calculate new cases, new cases per capita, and 7-day average

covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_cases = confirmed - lag(confirmed),
    new_cases_pc = new_cases / population *100000,
    new_cases_pc_7day = zoo::rollmean(new_cases_pc, 
                                     k = 7, 
                                     align = "right",
                                     fill=NA )
    ) -> covid_us

# Recode facemask policy

covid_us %>%
mutate(
  # Recode facial_coverings to create face_masks
    face_masks = case_when(
      facial_coverings == 0 ~ "No policy",
      abs(facial_coverings) == 1 ~ "Recommended",
      abs(facial_coverings) == 2 ~ "Some requirements",
      abs(facial_coverings) == 3 ~ "Required shared places",
      abs(facial_coverings) == 4 ~ "Required all times",
    ),
    # Turn face_masks into a factor with ordered policy levels
    face_masks = factor(face_masks,
      levels = c("No policy","Recommended",
                 "Some requirements",
                 "Required shared places",
                 "Required all times")
    ) 
    ) -> covid_us

# Create year-month and percent vaccinated variables

covid_us %>%
  mutate(
    year = year(date),
    month = month(date),
    year_month = paste(year, 
                       str_pad(month, width = 2, pad=0), 
                       sep = "-"),
    percent_vaccinated = people_fully_vaccinated/population*100  
    ) -> covid_us

# Recode Deaths
covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_deaths = deaths - lag(deaths),
    new_deaths_pc = new_deaths / population *100000,
    new_deaths_pc_7day = zoo::rollmean(new_deaths_pc, 
                                     k = 7, 
                                     align = "right",
                                     fill=NA ),
    new_deaths_pc_14day = zoo::rollmean(new_deaths_pc, 
                                     k = 14, 
                                     align = "right",
                                     fill=NA )
    ) -> covid_us

# ---- Recode Presidential Election Data ----

# Transform Presidential Election data
pres_df %>%
  mutate(
    year_election = year,
    state = str_to_title(state),
    # Fix DC
    state = ifelse(state == "District Of Columbia", "District of Columbia", state)
  ) %>%
  filter(party_simplified %in% c("DEMOCRAT","REPUBLICAN"))%>%
  filter(year == 2020) %>%
  select(state, state_po, year_election, party_simplified, candidatevotes, totalvotes
         ) %>%
  pivot_wider(names_from = party_simplified,
              values_from = candidatevotes) %>%
  mutate(
    dem_voteshare = DEMOCRAT/totalvotes*100,
    rep_voteshare = REPUBLICAN/totalvotes*100,
    winner = forcats::fct_rev(factor(ifelse(rep_voteshare > dem_voteshare,"Trump","Biden")))
  ) -> pres2020_df

# ---- Merge Data ----

dim(covid_us)

[1] 53678    60

dim(pres2020_df)

[1] 51  9

covid_df <- covid_us %>% left_join(
  pres2020_df,
  by = c("state" = "state")
)
dim(covid_us) # Same number of rows as covid_us w/ 8 additional columns

[1] 53678    60

Calculating Conditional Means

Now let’s revisit question 6, which asked you to calculate some conditional means:

Overall
Before the vaccine was widely available
After the vaccine was widely available

# ---- Deaths: Overall ----
covid_df %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Overall"
  ) -> deaths_overall

# ---- Deaths: Pre Vaccine ----
covid_df %>%
  filter(date < "2021-04-19") %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Pre Vaccine"
  ) -> deaths_pre_vax

# ---- Deaths: Post Vaccine ----
covid_df %>%
  filter(date >= "2021-04-19") %>%
  group_by(winner)%>%
  summarise(
    new_deaths = mean(new_deaths, na.rm=T),
    new_deaths_pc_7day = mean(new_deaths_pc_7day, na.rm=T),
  ) %>% 
  mutate(
    comparison = "Post Vaccine"
  ) -> deaths_post_vax

# ---- Tidy outputs for display ----

deaths_tab <- deaths_overall %>% 
  bind_rows(
  deaths_pre_vax,
  deaths_post_vax
) %>% 
  mutate(
    comparison = factor(
      comparison,
      levels = c("Overall","Pre Vaccine", "Post Vaccine")
      )
  )

knitr::kable(deaths_tab) %>% 
  kableExtra::kable_styling()

winner	new_deaths	new_deaths_pc_7day	comparison
Trump	19.34867	0.3402479	Overall
Biden	22.04853	0.2871843	Overall
Trump	22.82048	0.4000294	Pre Vaccine
Biden	30.61051	0.3801906	Pre Vaccine
Trump	17.07402	0.3016571	Post Vaccine
Biden	16.30683	0.2257111	Post Vaccine

# 1. Data
deaths_tab %>% 
  # 2. Aesthetics
  ggplot(
    aes(winner, new_deaths_pc_7day,
        fill = winner)
  ) +
  # 3. Geometries
  geom_bar(stat = "identity") +
  # 4. Facets
  facet_grid(~ comparison) +
  # 5. Labels and Themes
  guides(fill = "none") +
  labs(
    x = "State's won by",
    y = "Average # of Deaths per 100k\n(7-day rolling average)",
    title = "Red States have more Covid-19 deaths per capita after vaccine"
  )+
  theme_minimal()+
  theme(title = element_text(size = 10,face = "bold")) -> fig_q6

Using OLS to estimate conditional means

Question 7 asked you estimate the following OLS models:

$New Deaths = β_{0} + β_{1} Election Winner + ϵ$

$7-day average of New Deaths (per 100k) = β_{0} + β_{1} Election Winner + ϵ$

Note

Recall winner is a factor whose levels we set to be c("Trump","Biden").

lm() converts factors into binary indicators. Here 0="Trump" and 1="Biden"

m1_lab <- lm(new_deaths ~ winner, covid_df)
m2_lab <- lm(new_deaths_pc_7day ~ winner, covid_df)

m1_lab


Call:
lm(formula = new_deaths ~ winner, data = covid_us)

Coefficients:
(Intercept)  winnerBiden  
      19.35         2.70

# Just the coefficients
coef(m2_lab)

(Intercept) winnerBiden 
 0.34024787 -0.05306357

# Coefficients with summary stats (for later)
summary(m2_lab)


Call:
lm(formula = new_deaths_pc_7day ~ winner, data = covid_us)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0016 -0.2205 -0.1340  0.0966  5.8550 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.340248   0.002487  136.83   <2e-16 ***
winnerBiden -0.053064   0.003475  -15.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3978 on 52447 degrees of freedom
  (1229 observations deleted due to missingness)
Multiple R-squared:  0.004427,  Adjusted R-squared:  0.004408 
F-statistic: 233.2 on 1 and 52447 DF,  p-value: < 2.2e-16

texreg::htmlreg(list(m1_lab,m2_lab),
                custom.header = list(
                  "DV:" = 1:2
                ),
                custom.model.names = c(
                  "new_deaths",
                  "new_deaths_pc_7day"
                ))

Statistical models
	DV:
	new_deaths	new_deaths_pc_7day
(Intercept)	19.35^***	0.34^***
	(0.39)	(0.00)
winnerBiden	2.70^***	-0.05^***
	(0.55)	(0.00)
R²	0.00	0.00
Adj. R²	0.00	0.00
Num. obs.	52755	52449
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Q8: Vote shares, vaccinations, and deaths

Q9 asked you to fit three models exploring the relationship between:

$m3 = 14-day average of New Deaths (per 100k) = β_{0} + β_{1} Percent Vaccinated$ $m4 = Percent Vaccinated = β_{0} + β_{1} Republican Vote Share$ $m5 = 14-day average of New Deaths (per 100k) = β_{0} + β_{1} Republican Vote Share$

On September 23, 2021, when Leonhardt was writing

# Deaths modeled by percent vaccinated on 2021-09-23
m3_lab <- lm(new_deaths_pc_14day ~ percent_vaccinated, 
         covid_df,
         subset = date == "2021-09-23")

#  Percent vaccinated modeled by Republican vote share on 2021-09-23
m4_lab <- lm(percent_vaccinated ~ rep_voteshare, 
         covid_df,
         subset = date == "2021-09-23")

# Deaths modeled by Republican vote share on 2021-09-23
m5_lab <- lm(new_deaths_pc_14day ~ rep_voteshare, 
         covid_df,
         subset = date == "2021-09-23")

coef(m3_lab)

       (Intercept) percent_vaccinated 
         2.4232235         -0.0330773

coef(m4_lab)

  (Intercept) rep_voteshare 
   84.3326265    -0.5731175

coef(m5_lab)

  (Intercept) rep_voteshare 
  -0.36296799    0.01888996

Standardized Variables

When numeric variables are measured on different scales, it can be useful to construct standardized measures, sometimes called z-scores

$z -scores of x = \frac{x_{i} - μ_{x}}{σ_{x}}$

The z-score of Age is

$z -scores of Age = \frac{{Age}_{i} - Average Age}{Standard Deviation of Age}$

Note

Standardized variables all have a mean of 0 and a standard deviation of 1
Standardizing variables helps us interpret coefficients in multiple regressions with predictors measured on different scales

Let’s estimate the following models:

$\text{m1: tc_diff} = \beta_0 + \beta_1\text{interested}+\epsilon$

$\text{m2: tc_diff} = \beta_0 + \beta_1\text{pol_interest}+\epsilon$

$\text{m3: tc_diff} = \beta_0 + \beta_1\text{pol_interest_f} +\epsilon$

$\text{m4: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} +\epsilon$

$\text{m5: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} + \beta_3\text{interested} \times \text{age}+\epsilon$

$\text{m6: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} +\epsilon$

$\text{m7: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} + \beta_4\text{age}\times \text{income}+\epsilon$

We’ll use:

lm() to estimate models
coef() and summary() to examine our results
htmlreg()to format our results
data.frame() to create prediction dataframes
predict() to produce predicted values and cbind() to combine these predicted back into the prediction dataframes for plotting
ggplot() to display and interpret our models’ predictions

Statistical models
	Model 1	Model 2	Model 3	Model 4	Model 5
(Intercept)	19.35^***	0.34^***	2.42^***	84.33^***	-0.36
	(0.39)	(0.00)	(0.30)	(2.66)	(0.20)
winnerBiden	2.70^***	-0.05^***
	(0.55)	(0.00)
percent_vaccinated			-0.03^***
			(0.01)
rep_voteshare				-0.57^***	0.02^***
				(0.05)	(0.00)
R²	0.00	0.00	0.44	0.71	0.31
Adj. R²	0.00	0.00	0.43	0.70	0.29
Num. obs.	52755	52449	51	51	51
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Interpreting Regression Tables (Stargazing)

Each column is a model
Each row is a coefficient from that model with its standard error (more to come) in parentheses below
We interpret coefficients by looking at their sign, size, and significance
- Coefficients with asterisks * are statistically significant (more to come)
- It is unlikely that we would see a coefficient this big or bigger if the true coefficient were 0
Rule of thumb:

$If \frac{β}{s e} > 2 \to Statistically Significant$

`m1`: A binary indicator

$\text{m1: tc_diff} = \beta_0 + \beta_1\text{interested}+\epsilon$

m1 <- lm(tc_diff ~ interested, nes)
coef(m1)

   (Intercept) interestedTRUE 
      47.84500       12.74655

mean(nes$tc_diff[nes$interested == F], na.rm=T)

[1] 47.845

mean(nes$tc_diff[nes$interested == T], na.rm=T)

[1] 60.59155

mean(nes$tc_diff[nes$interested == T], na.rm=T) -
  mean(nes$tc_diff[nes$interested == F], na.rm=T)

[1] 12.74655

htmlreg(m1)

Statistical models
	Model 1
(Intercept)	47.84^***
	(1.26)
interestedTRUE	12.75^***
	(1.81)
R²	0.04
Adj. R²	0.04
Num. obs.	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df1 <- expand_grid(
  interested = c(F,T)
)

pred_df1  <- cbind(
  pred_df1 ,
  fit = predict(m1, pred_df1 )
)

# Produce figure

pred_df1 %>% 
  ggplot(aes(interested, fit))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)),
    aes(x= interested,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(col="red") -> fig_tc_m1

`m2`: A numerical predictor

$\text{m2: tc_diff} = \beta_0 + \beta_1\text{pol_interest}+\epsilon$

m2 <- lm(tc_diff ~ pol_interest, nes)
round(coef(m2),2)

 (Intercept) pol_interest 
       40.80         6.01

htmlreg(list(m1,m2))

Statistical models
	Model 1	Model 2
(Intercept)	47.84^***	40.80^***
	(1.26)	(2.35)
interestedTRUE	12.75^***
	(1.81)
pol_interest		6.01^***
		(0.98)
R²	0.04	0.03
Adj. R²	0.04	0.03
Num. obs.	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df2 <- expand_grid(
  pol_interest = na.omit(sort(unique(nes$pol_interest)))
)

pred_df2  <- cbind(
  pred_df2 ,
  fit = predict(m2, pred_df2 )
)

# Produce figure

pred_df2 %>% 
  ggplot(aes(pol_interest, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(pol_interest)),
    aes(x= pol_interest,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_line(col="red") +
  geom_point(col="red") -> fig_tc_m2

`m3`: A categorical indicator

$\text{m3: tc_diff} = \beta_0 + \beta_1\text{pol_interest_f} +\epsilon$

m3 <- lm(tc_diff ~ pol_interest_f, nes)
round(coef(m3),2)

                      (Intercept) pol_interest_fNot very Interested 
                            46.10                              1.73 
pol_interest_fSomewhat Interested     pol_interest_fVery Interested 
                             2.14                             14.49

nes %>% 
  group_by(pol_interest_f) %>%
  filter(!is.na(pol_interest_f)) %>% 
  summarise(
    mean = mean(tc_diff,na.rm = T),
    beta = round(mean - coef(m3)[1],3)
  )

# A tibble: 4 × 3
  pol_interest_f         mean  beta
  <fct>                 <dbl> <dbl>
1 Not at all Interested  46.1  0   
2 Not very Interested    47.8  1.73
3 Somewhat Interested    48.2  2.14
4 Very Interested        60.6 14.5

lm() converts the factor pol_interest_f into binary indicators for every value of pol_interest_f excluding “Not at all Interested”, the first level of the factor.

“Not at all Interested” is the reference category because all the other coefficients describe how the means for other levels of pol_interest_f differ from “Not at all Interested”

cbind(
m3$model[26:30,],
model.matrix(m3)[26:30,]
)

   tc_diff        pol_interest_f (Intercept) pol_interest_fNot very Interested
26      96 Not at all Interested           1                                 0
27      93   Somewhat Interested           1                                 0
28      95       Very Interested           1                                 0
29      75   Somewhat Interested           1                                 0
30       1   Not very Interested           1                                 1
   pol_interest_fSomewhat Interested pol_interest_fVery Interested
26                                 0                             0
27                                 1                             0
28                                 0                             1
29                                 1                             0
30                                 0                             0

htmlreg(list(m1,m2,m3))

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	47.84^***	40.80^***	46.10^***
	(1.26)	(2.35)	(3.53)
interestedTRUE	12.75^***
	(1.81)
pol_interest		6.01^***
		(0.98)
pol_interest_fNot very Interested			1.73
			(4.23)
pol_interest_fSomewhat Interested			2.14
			(3.90)
pol_interest_fVery Interested			14.49^***
			(3.76)
R²	0.04	0.03	0.04
Adj. R²	0.04	0.03	0.04
Num. obs.	1168	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df3 <- expand_grid(
  pol_interest_f = na.omit(sort(unique(nes$pol_interest_f)))
)

pred_df3  <- cbind(
  pred_df3 ,
  fit = predict(m3, pred_df3 )
)

# Produce figure

pred_df3 %>% 
  ggplot(aes(pol_interest_f, fit,
             col = pol_interest_f))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(pol_interest_f)),
    aes(x= pol_interest_f,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point() -> fig_tc_m3

`m4`: A binary and numerical predictor

$\text{m4: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} +\epsilon$

m4 <- lm(tc_diff ~ interested + age, nes)
coef(m4)

   (Intercept) interestedTRUE            age 
    32.1757944      9.7739375      0.3533876

htmlreg(list(m1, m4))

Statistical models
	Model 1	Model 2
(Intercept)	47.84^***	32.18^***
	(1.26)	(2.71)
interestedTRUE	12.75^***	9.77^***
	(1.81)	(1.84)
age		0.35^***
		(0.05)
R²	0.04	0.07
Adj. R²	0.04	0.07
Num. obs.	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df4 <- expand_grid(
  interested = c(F,T),
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10)
)

pred_df4  <- cbind(
  pred_df4 ,
  fit = predict(m4, pred_df4 )
)

# Produce figure

pred_df4 %>% 
  ggplot(aes(age, fit,
             col = interested,
             group = interested))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() -> fig_tc_m4

`m5`: An interaction between a binary and numerical predictor

$\text{m5: tc_diff} = \beta_0 + \beta_1\text{interested} + \beta_2\text{age} + \beta_3\text{interested} \times \text{age}+\epsilon$

m5 <- lm(tc_diff ~ interested*age, nes)
coef(m5)

       (Intercept)     interestedTRUE                age interestedTRUE:age 
        39.7626421         -6.7040163          0.1822814          0.3396523

htmlreg(list(m1, m4, m5))

Statistical models
	Model 1	Model 2	Model 3
(Intercept)	47.84^***	32.18^***	39.76^***
	(1.26)	(2.71)	(3.62)
interestedTRUE	12.75^***	9.77^***	-6.70
	(1.81)	(1.84)	(5.55)
age		0.35^***	0.18^*
		(0.05)	(0.08)
interestedTRUE:age			0.34^**
			(0.11)
R²	0.04	0.07	0.08
Adj. R²	0.04	0.07	0.08
Num. obs.	1168	1168	1168
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df5 <- expand_grid(
  interested = c(F,T),
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10)
)

pred_df5  <- cbind(
  pred_df5 ,
  fit = predict(m5, pred_df5 )
)

# Produce figure

pred_df5 %>% 
  ggplot(aes(age, fit,
             col = interested,
             group = interested))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(interested)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() -> fig_tc_m5

`m6`: Two numerical predictors

$\text{m6: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} +\epsilon$

m6 <- lm(tc_diff ~ age + income, nes)
coef(m6)

(Intercept)         age      income 
 33.1994121   0.4002682   0.1573334

htmlreg(list( m6))

Statistical models
	Model 1
(Intercept)	33.20^***
	(3.35)
age	0.40^***
	(0.06)
income	0.16
	(0.29)
R²	0.04
Adj. R²	0.04
Num. obs.	1049
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df6_age <- expand_grid(
  age = seq(min(nes$age, na.rm = T),
            max(nes$age, na.rm=T),
            length.out = 10),
  # Hold income constant at mean value
  income = mean(nes$income,na.rm=T)
)

pred_df6_income <- expand_grid(
  income = seq(min(nes$income, na.rm = T),
            max(nes$income, na.rm=T),
            length.out = 16),
  # Hold income constant at mean value
  age = mean(nes$age,na.rm=T)
)

pred_df6_age  <- cbind(
  pred_df6_age ,
  fit = predict(m6, pred_df6_age )
)

pred_df6_income  <- cbind(
  pred_df6_income ,
  fit = predict(m6, pred_df6_income )
)
# Produce figure

pred_df6_age %>% 
  ggplot(aes(age, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() +
  labs(title ="Age holding income constant") -> fig_tc_m6_age

pred_df6_income %>% 
  ggplot(aes(income, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= income,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point()+
  geom_line() +
  labs(title = "Income holding age constant") -> fig_tc_m6_income


fig_tc_m6 <- ggpubr::ggarrange(fig_tc_m6_age, fig_tc_m6_income)

`m7`: Interaction between two numerical predictors

$\text{m7: tc_diff} = \beta_0 + \beta_1\text{age} + \beta_2\text{income} + \beta_4\text{age}\times \text{income}+\epsilon$

m7 <- lm(tc_diff ~ age*income, nes)
coef(m7)

(Intercept)         age      income  age:income 
38.22449671  0.29406374 -0.78888643  0.01993978

htmlreg(list( m6, m7))

Statistical models
	Model 1	Model 2
(Intercept)	33.20^***	38.22^***
	(3.35)	(5.80)
age	0.40^***	0.29^*
	(0.06)	(0.12)
income	0.16	-0.79
	(0.29)	(0.94)
age:income		0.02
		(0.02)
R²	0.04	0.05
Adj. R²	0.04	0.04
Num. obs.	1049	1049
^*p < 0.001; ^p < 0.01; ^*p < 0.05

# Create Prediction Data Frame

pred_df7 <- expand_grid(
  age = seq(19, 95,
            by = 4),
  # Hold income constant at mean value
  income = seq(min(nes$income, na.rm = T),
            max(nes$income, na.rm=T),
            length.out = 16)
)



pred_df7  <- cbind(
  pred_df7 ,
  fit = predict(m7, pred_df7 )
)

# Produce figure

# Marginal effect of age at min, median, and max values of income
pred_df7 %>%
  mutate(
    income_at = case_when(
      income == 1 ~ "01",
      income == median(nes$income,na.rm=T) ~ "05",
      income == 16 ~ "16",
      T ~ NA_character_
    )
  ) %>% 
  filter(!is.na(income_at)) %>% 
  ggplot(aes(age, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= age,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(aes(col =income_at,
                 group = income_at))+
  geom_line(aes(col =income_at,
                group = income_at)
            ) -> fig_tc_m7_age

# Marginal effect of income at min, median, and max values of age
pred_df7 %>%
  mutate(
    age_at = case_when(
      age == min(nes$age, na.rm=T) ~ "19",
      age == 47 ~ "47", # close enough ...
      age ==  max(nes$age, na.rm=T) ~ "95",
      T ~ NA_character_
    )
  ) %>% 
  filter(!is.na(age_at)) %>% 
  ggplot(aes(income, fit,
             ))+
  # Add raw data
  geom_point(
    data = nes %>% 
      filter(!is.na(income)) %>%
      filter(!is.na(age))
      ,
    aes(x= income,
        y= tc_diff),
    alpha = .1,
    size = .2
  ) +
  geom_point(aes(col =age_at,
                 group = age_at))+
  geom_line(aes(col =age_at,
                group = age_at)
            ) -> fig_tc_m7_income


fig_tc_m7 <- ggpubr::ggarrange(fig_tc_m7_age, 
                               fig_tc_m7_income,
                               legend = "bottom")

Notation

Let’s adopt a little notation to help us think about the logic of Snow’s design:

$D$ : treatment indicator, 1 for treated neighborhoods (Lambeth), 0 for control neighborhoods (Southwark and Vauxhall)
$T$ : period indicator, 1 if post treatment (1854), 0 if pre-treatment (1849).
$Y_{d i} (t)$ the potential outcome of unit $i$
- $Y_{1 i} (t)$ the potential outcome of unit $i$ when treated between the two periods
- $Y_{0 i} (t)$ the potential outcome of unit $i$ when control between the two periods

	Pre-Period (T=0)	Post-Period (T=1)
Treated $D_{i} = 1$	$E [Y_{0 i} (0) \| D_{i} = 1]$	$E [Y_{1 i} (1) \| D_{i} = 1]$
Control $D_{i} = 0$	$E [Y_{0 i} (0) \| D_{i} = 0]$	$E [Y_{0 i} (1) \| D_{i} = 0]$

Before vs after comparisons:

Snow could have compared Labmeth in 1854 $(E [Y_{i} (1) | D_{i} = 1] = 19)$ to Lambeth in 1849 $(E [Y_{i} (0) | D_{i} = 1] = 85)$ , and claimed that moving the pumps upstream led to 66 fewer cholera deaths.
Assumes Lambeth’s pre-treatment outcomes in 1849 are a good proxy for what its outcomes would have been in 1954 if the pumps hadn’t moved $(E [Y_{0 i} (1) | D_{i} = 1])$ .
A skeptic might argue that Lambeth in 1849 $\neq$ Lambeth in 1854

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

Treatment-Control comparisons in the Post Period.

Snow could have compared outcomes between Lambeth and S&V in 1954 ( $E [Y i (1) | D i = 1] - E [Y i (1) | D i = 0]$ ), concluding that the change in pump locations led to 128 fewer deaths.
Here the assumption is that the outcomes in S&V and in 1854 provide a good proxy for what would have happened in Lambeth in 1954 had the pumps not been moved $(E [Y_{0 i} (1) | D_{i} = 1])$
Again, our skeptic could argue Lambeth $\neq$ S&V

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

Difference in Differences

To address these concerns, Snow employed what we now call a difference-in-differences design,

There are two, equivalent ways to view this design.

$\underset{1. Treat-Control |Post}{\underset{⏟}{{E [Y_{i} (1) | D_{i} = 1] - E [Y_{i} (1) | D_{i} = 0]}}} - \overset{Treated-Control|Pre}{\overset{⏞}{{E [Y_{i} (0) | D_{i} = 1] - E [Y_{i} (0) | D_{i} = 0]}}$

Difference 1: Average change between Treated and Control in Post Period
Difference 2: Average change between Treated and Control in Pre Period

Using linear regression to estimate a Difference in Difference

Recall that linear regression provides a…
- linear estimate of the conditional expectation function
In the canonincal pre-post, treated and control DiD, $β_{3}$ from the following linear regression will give us the ATT:

$y = β_{0} + β_{1} P o s t + β_{2} T r e a t e d + \underset{τ_{A T T}}{\underset{⏟}{β_{3} P o s t \times T r e a t e d}}$

cholera_df <- tibble(
  Period = factor(c("Pre","Pre","Post","Post"),
                  levels = c("Pre","Post")),
  Year = c(1849,1849, 1854,1854),
  Treated = factor(c("Control","Treated","Control","Treated")),
  Company = c("S&V","Lambeth","S&V","Lambeth"),
  Deaths = c(135,85,147,19)
)

m_did <- lm(Deaths~Period + Treated + Period:Treated, cholera_df)

m_did


Call:
lm(formula = Deaths ~ Period + Treated + Period:Treated, data = cholera_df)

Coefficients:
              (Intercept)                 PeriodPost  
                      135                         12  
           TreatedTreated  PeriodPost:TreatedTreated  
                      -50                        -78

Statistical models
	Model 1
(Intercept)	135.00
Post (1854)	12.00
Treated (Lambeth)	-50.00
Post X Treated (DID)	-78.00
R²	1.00
Adj. R²
Num. obs.	4
^*p < 0.001; ^p < 0.01; ^*p < 0.05

POLS 1600 Multiple Regression Updated Apr 22, 2025

POLS 1600
Overview
Class Plan
Annoucements
Assignment 2
Feedback
Packages for today
Lab 5 & Simple Linear Regression
Review: Key Concepts from the Lab
Review: Simple Linear Regression
Calculating Conditional...
Using OLS to estimate...
Q7: When the CEF...
Q8: Vote shares,...
Q9: Visualizing Vote...
Q10: Alternative Explanations
Previewing Lab 6
Red Covid
Testing Alternative...
Testing Alternative...
Loading data from the Census
Tidy Census Data
Merge Census data into Covid data
Subset Data
Standardized Variables
Standardizing variables for the lab
Save Data
Multiple Regression
Conceptual: Multiple Regression
Conceptual: Multiple Regression
Practical: Multiple Regression
Technical: Simple Linear Regression
Technical: Multiple Regression
Theoretical: Multiple Regression
Estimating and Interpretting Multiple Regression
Estimating and Interpretting...
Estimating and Interpreting...
Regression Tables
Making Regression Tables in R
Interpreting Regression Tables (Stargazing)
m1: A binary indicator
m2: A numerical predictor
m3: A categorical indicator
m4: A binary and numerical predictor
m5: An interaction...
m6: Two numerical...
m7: Interaction between...
Difference-in-Differences
Motivating Example: What causes Cholera?
Notation
Causal Effects
Average Treatment on Treated
Average Treatment on Treated
Data
How can we estimate the effect of moving pumps upstream?
Before vs after comparisons:
Treatment-Control comparisons in the Post Period.
Difference in Differences
Difference in Differences
Difference in Differences
Identifying Assumption of a Difference in Differences Design
Parallel Trends
Using linear regression to estimate a Difference in Difference
Summary
Extensions and limitations
Applications
Summary
Summary - Linear Regression
Summary - Difference-in-Differences
References