POLS 1600

Casual Inference in
Observational Designs &
Simple Linear Regression

Updated Apr 22, 2025

You’re learning how to map conceptual tasks to commands in R
Skill	Common Commands
Setup R	install.packages(),library(), ipak()
Load data	read_csv(), load()
Get HLO of data	df$x, glimpse(), table(), summary()
Transform data	<-, mutate(), ifelse(), case_when()
Reshape data	pivot_longer(), left_join()
Summarize data numerically	mean(), median(), summarise(), group_by()
Summarize data graphically	ggplot(), aes(), geom_

Mean: Conceptual Understanding

A mean is:

A common and important measure of central tendency (what’s typical)
- It’s the arithmetic average you learned in school
- It’s an unbiased estiamte of $E [X]$ ,
- We can think of it as the balancing point of a distribution
A conditional mean is the average of one variable $X$ , when some other variable, $Z$ takes a value $z$
- Think about the average height in our class (unconditional mean) vs the average height among men and women ([conditional means].{blue})

Theoretical: Distributions can be described by their moments

First Moment (Mean)
$μ_{1}^{'} = E [X]$
- This is simply the expected value (mean) of ( X ).
Second Central Moment (Variance)
$μ_{2} = E [(X - μ)^{2}]$
- Measures the spread or dispersion of ( X ) around the mean.
Third Central Moment (Skewness)
$γ_{1} = \frac{E [(X - μ)^{3}]}{σ^{3}}$
- Measures the asymmetry of the distribution.
Fourth Central Moment (Kurtosis)
$γ_{2} = \frac{E [(X - μ)^{4}]}{σ^{4}}$
- Measures the tailedness of the distribution (how extreme values occur compared to a normal distribution).

Experimental Designs

Experimental designs are studies in which a causal variable of interest, the treatement, is manipulated by the researcher to examine its causal effects on some outcome of interest
Random assignment is the key to causal identification in experiments because it creates statistical independence between treatment and potential outcomes any potential confounding factors

$Y_{i} (1), Y_{i} (0), X_{i}, U_{i} ⫫ D_{i}$

Randomization creates credible counterfactual comparisons

If treatment has been randomly assigned, then:

The only thing that differs between treatment and control is that one group got the treatment, and another did not.
We can estimate the Average Treatment Effect (ATE) using the difference of sample means

$\begin{aligned} E [\frac{\sum_{1}^{m} Y_{i}}{m} - \frac{\sum_{m + 1}^{N} Y_{i}}{N - m}] & = \overset{\begin{array}{c} Average outcome \\ among treated \\ units \end{array}}{\overset{⏞}{E [\frac{\sum_{1}^{m} Y_{i}}{m}]}} - \overset{\begin{array}{c} Average outcome \\ among control \\ units \end{array}}{\overset{⏞}{E [\frac{\sum_{m + 1}^{N} Y_{i}}{N - m}]}} \\ = E [Y_{i} (1) | D_{i} = 1] - E [Y_{i} (0) | D_{i} = 0] \end{aligned}$

Observational Designs

Observational designs are studies in which a causal variable of interest is determined by someone/thing other than the researcher (nature, governments, people, etc.)
Since treatment has not been randomly assigned, observational studies typically require stronger assumptions to make causal claims.
Generally speaking, these assumptions amount to a claim about conditional independence

$Y_{i} (1), Y_{i} (0), X_{i}, U_{i} ⫫ D_{i} | K_{i}$

Where after conditioning on $K_{i}$ , some knowledge about the world and how the data were generated, our treatment is as good as (as-if) randomly assigned (hence conditionally independent)
- Economists often call this assumption of selection on observables

Directed Acyclic Graphs

Directed Acyclic Graphs provide a way of encoding assumptions about casual relationships
- Directed Arrows $\to$ describe a direct causal effect
- Arrow from $D \to Y$ means $Y_{i} (d) \neq Y_{i} (d^{'})$ “The outcome ( $Y$ ) for person $i$ when D happens ( $Y_{i} (d)$ ) is different than the the outcome when $D$ doesn’t happen ( $Y_{i} (d^{'})$ )
- No arrow = no effect ( $Y_{i} (d) = Y_{i} (d^{'})$ )
- Acyclic: No cycles. A variable can’t cause itself

Causal Identification through Subclassification

Economists call $Y_{i} (1), Y_{i} (0) ⊥ D_{i} | X_{i}$ an assumption of Selection on Observables
- Controlling for what we can observe, $X$ , $D$ is conditionally independent of Potential Outcomes
- Violated if there were some other factor, $U$ that influenced both $D$ and $Y$ (i.e. $U$ is a confounder)
$0 < P r (D = 1 | X) < 1$ is called an assumption of Common Support
- There is a non-zero probability of receiving the treatment for all values of X
- Violated if only one subgroup had access to the treatment (e.g. Vaccine by age group comparisons)

Matching

Different types of matching procedures:

Exact matching: Find exact matches between treatment and control observations for all covariates $X$
Coarsened exact matching: Find approximate matches within ranges of values for $X$
Distance-metric matching: Calculate a distance metric between observations based on their values of $X$ , and match treated and control to minimize that distance
Propensity score matching: Calculate the propsenity to receive treatment using $X$ to predict $D$ and treated and control based on their propensity scores

Causal Identification with Matching (ICYI)

Matching again requires an assumption of selection on observables

$\begin{aligned} Y_{i} (1), Y_{i} (0) ⊥ D_{i} | X_{i} & Selection on Observables \\ 0 < P r (D_{i} = 1 | X_{i}) < 1 & Common Support \end{aligned}$

Matching procedures like propensity score matching, allow us to match treated and control observations based on a propensity score, a predicted value of receiving the treatment, $D$ based on observed variables, $X$ .

$p (X_{i}) = P r (D = 1 | X_{i}) = π_{i}$

Allowing us to estimate an ATE conditional on $π_{i}$

$\begin{aligned} A T E & = E [Y_{i} (1) - Y_{i} (0) | p (X_{i}) = π_{i}] \\ = E [Y_{i} (1) - Y_{i} (0) | p (X_{i}) = π_{i}, D_{i} = 1] \\ = E [Y_{i} | p (X_{i}) = π_{i}, D_{i} = 1] - E [Y_{i} | p (X_{i}) = π_{i}, D_{i} = 0] \end{aligned}$

Regression

We will spend the next two weeks talking in detail about regression, in general and linear regression in particular.
Today we’ll introduce some basic notation and simple examples
Conceptually, think of regression as
- a tool to make predictions
- by fitting lines to data
Theoretically, we will build towards an understanding of linear regression as a “linear estimate of the conditional expectation function $(C E F = E [Y | X])$

Linear Regression and the Line of Best Fit

The goal of linear regression is to choose coefficients $β_{0}$ and $β_{1}$ to summarizes the relationship between $y$ and $x$

$y_{i} = β_{0} + β_{1} x_{i} + ϵ$

To accomplish this we need some sort of criteria.
For linear regression, that criteria is minimizing the error between what our model predicts $\hat{y_{i}} = β_{0} + β_{1} x_{i}$ and what we actually observed $(y_{i})$
More on this to come. But first…

Regression Notation

$y_{i}$ an outcome variable or thing we’re trying to explain
- AKA: The dependent variable, The response Variable, The left hand side of the model
$x_{i}$ a predictor variables or things we think explain variation in our outcome
- AKA: The independent variable, covariates, the right hand side of the model.
- Cap or No Cap: I’ll use $X$ (should be $X$ ) to denote a set (matrix) of predictor variables. $y$ vs $Y$ can also have technical distinctions (Sample vs Population, observed value vs Random Variable, …)
$β$ a set of unknown parameters that describe the relationship between our outcome $y_{i}$ and our predictors $x_{i}$
$ϵ$ the error term representing variation in $y_{i}$ not explained by our model.
- Technically $ϵ$ refers to theoretical error inherent to the data generating process, while ${\hat{ϵ}}_{i}$ or $u_{i}$ is used to refer to residuals, an estimated error that reflects the difference between the observed ( $y_{i}$ ) and predicted ( ${\hat{y}}_{i}$ ) values.

Practical: Predicted values from a Linear Regression

Often it’s useful for interpretation to obtain predicted values from a regression.
To obtain predicted vales $(\hat{y})$ , we simply plug in a value for $x$ (In this case, $A g e$ ) and evaluate our equation.
For example, might we expect attitudes to differ among an 18-year-old college student and their 68-year-old grandparent?

${\hat{F T}}_{x = 18} = 62.82 + - 0.2 \times 18 = 59.16$ ${\hat{F T}}_{x = 65} = 62.82 + - 0.2 \times 68 = 49.01$

How did `lm()` choose $β_{0}$ and $β_{1}$

In an intro stats course, we would walk through the process of finding

$Find \hat{β_{0}}, \hat{β_{1}} {arg min}_{β_{0}, β_{1}} \sum (y_{i} - (β_{0} + β_{1} x_{i}))^{2}$ Which involves a little bit of calculus. The big payoff is that

$β_{0} = \bar{y} - β_{1} \bar{x}$ And

$β_{1} = \frac{C o v (x, y)}{V a r (x)}$ Which is never quite the epiphany, I think we think it is…

The following slides walk you through the mechanics of this exercise. We’re gonna skip through them in class, but they’re there for your reference

Some useful facts about derivatives

Derivative of a constant

$f^{'} (c) = 0$

Derivative of a line f(x)=2x

$f^{'} (2 x) = 2$

Derivative of $f (x) = x^{2}$

$f^{'} (x^{2}) = 2 x$

Chain rule: y= f(g(x)). The derivative of y with respect to x is

$\frac{d}{d x} (f (g (x))) = f^{'} (g (x)) g^{'} (x)$

The derivative of the “outside” times the derivative of the “inside,” remembering that the derivative of the outside function is evaluated at the value of the inside function.

Notation

Let’s adopt a little notation to help us think about the logic of Snow’s design:

$D$ : treatment indicator, 1 for treated neighborhoods (Lambeth), 0 for control neighborhoods (Southwark and Vauxhall)
$T$ : period indicator, 1 if post treatment (1854), 0 if pre-treatment (1849).
$Y_{d i} (t)$ the potential outcome of unit $i$
- $Y_{1 i} (t)$ the potential outcome of unit $i$ when treated between the two periods
- $Y_{0 i} (t)$ the potential outcome of unit $i$ when control between the two periods

	Pre-Period (T=0)	Post-Period (T=1)
Treated $D_{i} = 1$	$E [Y_{0 i} (0) \| D_{i} = 1]$	$E [Y_{1 i} (1) \| D_{i} = 1]$
Control $D_{i} = 0$	$E [Y_{0 i} (0) \| D_{i} = 0]$	$E [Y_{0 i} (1) \| D_{i} = 0]$

Before vs after comparisons:

Snow could have compared Labmeth in 1854 $(E [Y_{i} (1) | D_{i} = 1] = 19)$ to Lambeth in 1849 $(E [Y_{i} (0) | D_{i} = 1] = 85)$ , and claimed that moving the pumps upstream led to 66 fewer cholera deaths.
Assumes Lambeth’s pre-treatment outcomes in 1849 are a good proxy for what its outcomes would have been in 1954 if the pumps hadn’t moved $(E [Y_{0 i} (1) | D_{i} = 1])$ .
A skeptic might argue that Lambeth in 1849 $\neq$ Lambeth in 1854

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

Treatment-Control comparisons in the Post Period.

Snow could have compared outcomes between Lambeth and S&V in 1954 ( $E [Y i (1) | D i = 1] - E [Y i (1) | D i = 0]$ ), concluding that the change in pump locations led to 128 fewer deaths.
Here the assumption is that the outcomes in S&V and in 1854 provide a good proxy for what would have happened in Lambeth in 1954 had the pumps not been moved $(E [Y_{0 i} (1) | D_{i} = 1])$
Again, our skeptic could argue Lambeth $\neq$ S&V

Company	1849 (T=0)	1854 (T=1)
Lambeth (D=1)	85	19
Southwark and Vauxhall (D=0)	135	147

Difference in Differences

To address these concerns, Snow employed what we now call a difference-in-differences design,

There are two, equivalent ways to view this design.

$\underset{1. Treat-Control |Post}{\underset{⏟}{{E [Y_{i} (1) | D_{i} = 1] - E [Y_{i} (1) | D_{i} = 0]}}} - \overset{Treated-Control|Pre}{\overset{⏞}{{E [Y_{i} (0) | D_{i} = 1] - E [Y_{i} (0) | D_{i} = 0]}}$

Difference 1: Average change between Treated and Control in Post Period
Difference 2: Average change between Treated and Control in Pre Period

POLS 1600 Casual Inference in Observational Designs & Simple Linear Regression Updated Apr 22, 2025

POLS 1600
Overview
Overview
Learing goals
Annoucements
Review
Review
Data Wrangling
Data wrangling
Mapping Concepts to Code
Descriptive Statiscs
Descriptive statistics
Levels of understanding
Levels of understanding in POLS 1600
Mean: Conceptual Understanding
Mean as a balancing point
Mean: Practical
Conditional Means: Practical
Mean: Definitional
Mean: Definitional
Mean: Theoretical
Mean: Theoretical
Mean: Theoretical
Mean: Theoretical
The sample mean is an unbiased estimator of the population mean
The sample mean is an unbiased estimator of the population mean
Theoretical: Moment Generating Functions
Theoretical: Distributions can be described by their moments
Levels of understanding
Data Visualization
Data Visualization
You’re about to be reincaranted…
My undergrads are about to be reincarnated: HLO
Basic Plot
Use a factor to label and order responses
Revised figure
What creature and...
Adding labelled values
You’re about to be reincarnated:
Data visualization is an iterative process
Setup
New packages
Install new packages
Census API
Packages for today
Previewing the Lab
Red Covid
Preview of the Lab
Lab: Questions 1-5: Review
Lab: Questions 6-10: Simple Linear Regression
Guidance
Q1: Setup your workspace
Q2 Load the data
Q2.1 Load the Covid-19 Data
Q2.2 Load Election Data
Q3 Describe the structure of each dataset
Q3 Describe the structure of each dataset
Q4 Recode the data for analysis
Q4.1 Recode the Covid-19
Q4.2 Calculate Rolling Means of Covid Deaths
Rolling Averages
Look at the output of zoo::rollmean()
Comparing Daily Cases to Rolling Average
Q4.3 Recode Presidential data
Q5 merging data
Advice for merging
Causal Inference
Causal inference is about counterfactual comparisons
Causal Identification
Experimental Designs
Randomization creates credible counterfactual comparisons
Observational Designs
Causal Inference in Observational Studies
Directed Acyclic Graphs
Two Ways to Describe Causal Claims
Directed Acyclic Graphs
Types of variables in a DAG
DAGs illustrate two sources of bias:
Confounding Bias:...
Collider Bias: The...
When to control for a variable:
Covariate Adjustment
Covariate Adjustment
Covariate Adjustment
Three approaches to covariate adjustment
Causal Identification through Subclassification
Causal Identification through Subclassification
Causal Identification through Subclassification
Example of Subclassification
Limits of Subclassification
The Curse of Dimensionality
Matching
Matching
Source...
Causal Identification with Matching (ICYI)
What to Know about Matching
Regression
Three approaches to covariate adjustment
Simple Linear Regression
Understanding Linear Regression
Understanding Linear Regression
Conceptual: Linear Regression
Conceptual: Linear Regression
Conceptual: Linear Regression
Linear Regression and the Line of Best Fit
Regression Notation
Linear Regression
Linear Regression
Practical: Estimating a Linear Regression
Practical: Estimating a Linear Regression
The lm() function
Practical: Interpreting a Linear Regression
Practical: Interpreting a Linear Regression
Practical: Predicted values from a Linear Regression
Practical: Predicted values from a Linear Regression
Practical: Predicted values from a Linear Regression
Practical: Predicted values from a Linear Regression
Practical: Visualizing...
Technichal: Mechanics of Linear Regression
How did lm() choose $β_{0}$ and $β_{1}$
What’s an error?
Why are we squaring and summing $ϵ$
How do we minimize $\sum ϵ^{2}$
How did lm() choose $β_{0}$ and $β_{1}$
How do we minimize $\sum ϵ^{2}$
Derivatives
Some useful facts about derivatives
Finding a Local Minimums
Partial Derivatives
Minimizing the sum of squared errors
Minimizing the sum of squared errors
Solving for $β_{0}$
Solving for $β_{1}$
Solving for $β_{1}$
Solving for $β_{1}$
Theoretical: OLS provides a linear estimate of CEF: E[Y|X]
Linear Regression is a many splendored thing
Linear Regression is a many splendored thing
The Conditional Expectation Function
Linear Approximations...
What you need to know about Regression
What you need to know about Regression
Difference-in-Differences
Motivating Example: What causes Cholera?
Notation
Causal Effects
Average Treatment on Treated
Average Treatment on Treated
Data
How can we estimate the effect of moving pumps upstream?
Before vs after comparisons:
Treatment-Control comparisons in the Post Period.
Difference in Differences
Difference in Differences
Difference in Differences
Identifying Assumption of a Difference in Differences Design
Parralel Trends
Summary
Extensions and limitations
Applications
References