POLS 1600

Probably too much Probability

Updated Jan 13, 2025

Overview

Class Plan

Announcements (5 min)
Feedback (5 min)
Class plan
- Probability (10 min)
- Conditional Probability (10 min)
- Probability Distributions (10 min)
- Expected Values and Variances (10 min)
- Standard Errors (10 min)
- Previewing Lab 8 (10 min)

Annoucements: Assignment 2

Full prompt here

A revised description of your group’s research project
A description of a linear model implied by your question
R code that loads some potentially relevant data to your question and at least one descriptive summary of that data.
Some information about your group such as:
- A group name¹
- A group color or color scheme
- A group motto, mascot, crest, etc.
- Your group’s theme song
- Your group’s astrological sign
- Anything else that you think well help you form strong ingroup bounds that facilitate collaboration

Setup: Packages for today

## Pacakges for today
the_packages <- c(
  ## R Markdown
  "kableExtra","DT","texreg","htmltools",
  ## Tidyverse
  "tidyverse", "lubridate", "forcats", "haven", "labelled",
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes", "ggpubr", 
  "GGally", "scales", "dagitty", "ggdag", "ggforce",
  # Data 
  "COVID19","maps","mapdata","qss","tidycensus", "dataverse", 
  # Analysis
  "DeclareDesign", "easystats", "zoo"
)

## Define a function to load (and if needed install) packages

ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

## Install (if needed) and load libraries in the_packages
ipak(the_packages)

   kableExtra            DT        texreg     htmltools     tidyverse 
         TRUE          TRUE          TRUE          TRUE          TRUE 
    lubridate       forcats         haven      labelled         ggmap 
         TRUE          TRUE          TRUE          TRUE          TRUE 
      ggrepel      ggridges      ggthemes        ggpubr        GGally 
         TRUE          TRUE          TRUE          TRUE          TRUE 
       scales       dagitty         ggdag       ggforce       COVID19 
         TRUE          TRUE          TRUE          TRUE          TRUE 
         maps       mapdata           qss    tidycensus     dataverse 
         TRUE          TRUE          TRUE          TRUE          TRUE 
DeclareDesign     easystats           zoo 
         TRUE          TRUE          TRUE

Feedback

What did we like

What did we dislike

Our advice for POLS 1600

Our Fashion Advice

Option	Choice	Votes
Jacket	Tuxedo	9
Palette	Spring (Warm + Light)	10
Pant	Khakis	6
Patterns	How 'bout a fun graphic or print	7
Shoe	Crocs	9
Tie	Bow tie	12
Top	Sports jersey	8

Naive professor in class survey: "Help me with my fit"
Students: "Ok, bet. Tuxedo, crocs, and a jersey" https://t.co/NPmBZhsfen pic.twitter.com/cWlytcZ12h
— Paul Testa ((ProfPaulTesta?)) March 17, 2023

Probability

Probability describes the likelihood of an event happening.
Statistics uses probability to quantify uncertainty about estimates and hypotheses.
To do this, we will need to understand:
- Definitions (experiment, sample space, events)
- Three rules of probability (Kolmogorov axioms)
- Two interpretations interpreting probabilities (Frequentist and Bayesian)

Experiments, sample spaces, sets, and events

In probability theory, an experiment describes a repeatable process where the outcome is uncertain
- Processes where the outcomes are uncertain are called non-deterministic or stochastic
The sample space of an experiment is the set \((\Omega\) “omega”, or \(S\)) of all the possible outcomes of an experiment

Experiments, sample spaces, sets, and events

Sets can be:
- empty \(( A: \{\emptyset\})\)
- a single event \(( Coin: \{\text{Heads}\})\)
- multiple events \(( Odd\, \#s: \{\text{1,3,5}\})\)
- infinite \((\mathbb{R}: \text{ The set of real numbers}\{ -\infty \dots +\infty\}\))
An event, \((E\) or \(A)\) is a subset of outcomes in the sample space
- The sample space for a coin flip is \(\Omega = \{\text{Heads, Tails}\}\)
- The event Heads is a subset of \(\Omega\)

Subsets

Subset: Let \(D\) be the set outcomes for a 6-side die: \(D=\{1,2,3,4,5,6\}\)
- \(Primes=\{2,3,5\}\)
- \(Primes \subset D \iff \forall X \in Primes, X \in D\)

Unions, Intersections, and Complements

Unions
- \(A \cup B = \{X:X \in A \lor X \in B \}\)
- Either \(A\), \(B\) or both \(A\) and \(B\) occur
Intersections
- \(A \cap B = \{X:X \in A \land X \in B \}\)
- Both \(A\) and \(B\) occur
Complements
- \(A'=A^\complement = \{X:X\notin A\}\)
- \(A'=A^\complement\) means \(A\) does not occur
- \(\emptyset^\complement=S\) and \(S^\complement=\emptyset\)

Source

Three Rules of Probability

Probability is defined by three rules or assumptions called the Kolmogorov Axioms

Positivity: The probability of any event \(A\) is nonnegative

\[Pr(A) \geq 0 \]

Certainty: The probability that one of the outcomes in the sample space occurs is 1

\[Pr(\Omega) = 1 \]

Additivity: If events \(A\) and \(B\) are mutually exclusive, then:

\[Pr(A \text{ or } B) = Pr(A) + Pr(B)\]

The Addition Rule

For events, \(A\) and \(B\), the addition rule says we can find the probability of either \(A\) or \(B\) occurring:

\[Pr(A \cup B) = Pr(A \text{ or } B) = Pr(A) + Pr(B) - \underbrace{Pr(A \text{ and } B)}_{\text{aka } Pr(A \cap B)}\]

In words: The probability of either A or B occurring is the probability that A occurs plus the probability that B occurs - minus the probability that both occur (so that we’re not double counting…)

Source

The Law of Total Probability (Part 1)

For any event two events, \(A\) and \(B\), the probability of \(A\) \((Pr(A))\) can be decomposed into the sum of the probabilities of two mutually exclusive events:

\[Pr(A) = Pr(A \text{ and } B) + Pr(A \text{ and } B^{\complement})\]

Source

Two interpretations of probablity

Probabilities are defined by these three axioms
The are two broad ways of interpreting what probabilities mean:
- Frequentist
- Bayesian

Frequentist interpretations of probability

Probability describes how likely it is that some event happens.
- Flip a fair coin, the probability of heads is Pr(Heads) = 0.5
Frequentist: view this probability as the limit of the relative frequency of an event over repeated trials.

\[Pr(E) = \lim_{n \to \infty} \frac{n_{E}}{n} \approx \frac{ \text{# of Times E happened}}{\text{Total # of Trials}}\]

Thinking about probability as a relative frequency, requires us to know how to count the number of times an event occurred (see also)

Frequentist interpretations of probability

Probabilities from a Frequentist perspective are defined by fixed and unknown parameters
The goal of statistics for a frequentist is to learn about these parameters from data.
Frequentist statistics often ask questions like “What is the probability of observing some data \(Y\), given a hypothesis about the true value of parameter(s), \(\theta\), that generated it.

Frequentist interpretations of probability

For example, suppose we wanted to test whether a coin is “fair” \((p = Pr(Heads) = .5; q = Pr(Tails) = 1-p = .5).\) We could:

Flip a fair coin 10 times. Our estimate of the \(Pr(H)\) is the number of heads divided by 10. It could be 0.5, but also 0 or 1, or some number in between.
Flip a coin 100 times and our estimate will be closer to the true \(paramter\).
Flip a coin an \(\infty\) amount of times and the relative frequency will converge to the true parameter \((Pr(H) = \lim_{n \to \infty} \frac{n_{H}}{n} = p = 0.5 \text{ for a fair coin})\)

Bayesian interpretations of probability

Frequentist interpretations make sense for describing processes that we could easily repeat (e.g. Coin flips, Surveys, Experiments)
But feel more convoluted when trying to describe events like “the probability of that Biden wins reelection.”
Bayesian interpretations of probability view probabilities as subjective beliefs.
The task for a Bayesian statistics is to update these prior beliefs () based on a model of the likelihood of observing some data to form new beliefs after observing the data (called the posterior beliefs).

Bayesian Updating

Bayesians update their beliefs according to Bayes Rule, which says:

\[\text{posterior} \propto \text{likelihood} \times \text{prior}\] More formally:

\[\underbrace{Pr(\theta|Y)}_{\text{Posterior}} \propto \underbrace{Pr(Y|\theta)}_{\text{Likelihood}}) \times \underbrace{Pr(\theta)}_{\text{Prior}}\]

Bayesian vs Frequentists

Our two main tools for doing statistical inference in this course

Hypothesis Testing
Interval Estimation

Follow largely from frequentist interpretations of probability

Bayesian vs Frequentists

The differences between Bayesian and Frequentist frameworks, are both philosophical and technical in nature

Is probability a relative frequency or subjective belief? How do we form and use prior beliefs
Bayesian statistics relies heavily on algorhithms for Markov Chain Monte-Carlo simulations made possible by advances in computing.

For most of the questions in this course, these two frameworks will yield similar (even identical) conclusions.

Sometimes it’s helpful to think like a Bayesian, others, like a frequentist

Summary: Probability

Probability is a measure of uncertainty telling us how likely an event (or events) is (are) to occur
Probabilities are:
- Non-negative
- Unitary
- Additive
Two different interpretations of probability:
- Frequentists: Probability is a long run relative frequency
- Bayesians: Probability reflect subjective beliefs which we update upon observing data

Conditional Probability

Conditional Probability: Definition

The conditional probability that event A occurred, given that event B occurred is written as \(Pr(A|B)\) (“The probability of A given B”) and defined as:

\[Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)} = \frac{\text{Probability of Both A and B}}{\text{Probability of B}}\]

\(Pr(A \cap B)\) is the same as \(Pr(A \text{ and } B)\) is the joint probability of both events occurring
\(Pr(B)\) is the marginal probability of B occuring

Conditional Probability: Multiplication Rule

Joint probabilities are symmetrical. \(Pr(A \cap B) = Pr(B \cap A)\).

By rearranging terms:

\[Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)}\] We get the multiplication rule:

\[Pr(A \cap B) = Pr(A|B)Pr(B) = Pr(B|A)Pr(A)\]

The Law of Total Probability (Part 2)

We can use multiplication rule to derive an alternative form of the law of total probability:

\[Pr(A) = Pr(A|B)Pr(B) + Pr(A|B^\complement)Pr(B^\complement)\]

Independence

Events \(A\) and \(B\) are independent if

\[Pr(A|B) = Pr(A) \text{ and } Pr(B|A) = Pr(B)\]

Conceputally, If \(A\) and \(B\) are independent knowing whether \(B\) occurred, tells us nothing about \(A\), and so the conditional probability of \(A\) given \(B\), \(Pr(A|B)\) is equal to the unconditional, or marginal probability, \(Pr(A)\)

Independence

Formally, two events are statistically independent if and only if the joint probability is equal to product of the marginal probabilities

\[Pr(A\text{ and }B) = Pr(A)Pr(B)\]

Conditional Independence

We can extend the concept of independence to situations with more than two events:

If events \(A\), \(B\), and \(C\) are jointly independent then:

\[Pr(A \cap B \cap C) = Pr(A)Pr(B)Pr(C)\]

Joint independence implies pairwise independence and conditional independence:

\[Pr(A \cap B | C) = Pr(A|C)Pr(B|C)\] But not the reverse.

Bayes Rule

Bayes rule is theorem for how we should update our beliefs about \(A\) given that \(B\) occurred:

\[Pr(A|B) = \frac{Pr(B|A)Pr(A)}{Pr(B)} = \frac{Pr(B|A)Pr(A)}{Pr(B|A)Pr(A)+Pr(B|A^\complement)Pr(A^\complement)}\] Where

\(Pr(A)\) is called the prior probability of A (our initial belief)
\(Pr(A|B)\) is called the posterior probability of A given B (our updated belief after observing B)

What’s the probability you have Covid-19 given a positive test

\[Pr(Covid|Test +) = \frac{Pr(+|Covid)Pr(Covid)}{Pr(Test +)}\]

Possible Outcomes

Four possible outcomes

Test	Have Covid	Don't Have Covid
Positive	True Positive	False Positive
Negative	False Negative	True Negative

What’s the probability you have Covid-19 given a positive test

Let’s assume:

1 out 100 people have Covid-19
Our test correctly identifies true positives 95 percent of the time (sensitivity = True Positive Rate)
Our test correctly identifies true negatives 95 percent of the time (specificity = True Negative Rate)

In a sample of 100,000 people then:

Test	Have Covid	Don't Have Covid
Positive	950	4950
Negative	50	94050

What’s the probability you have Covid-19 given a positive test

Now we can calculate the relevant quantities for:

\[Pr(Covid|Test +) = \frac{Pr(+|Covid)Pr(Covid)}{Pr(Test +)}\]

\(Pr(+|Covid) = 950/(1000) \approx 0.95\)
\(Pr(Covid) = 1000/100000 \approx 0.01\)
\(Pr(+) = Pr(+|Covid) + Pr(+|Covid)= .95*.01 + .05*.99 \approx 0.059\)

Which yields:

\[Pr(Covid|Test +) = \frac{Pr(+|Covid)Pr(Covid)}{Pr(Test +)} = \frac{0.95 \times 0.01}{0.059} \approx 0.16\]

What if you took a second test?

We could use our updated posterior belief as our new prior:

\[Pr(Covid|2nd +) = \frac{0.95 \times 0.16}{0.16\times0.95 + (1-0.16)\times 0.95 } \approx 0.783\]

Now we’re much more confident that we have Covid-19

Random Variables and Probability Distributions

Random Variables

Random variables assign numeric values to each event in an experiment.
- Mutually exclusive and exhaustive, together cover the entire sample space.
Discrete random variables take on finite, or countably infinite distinct values.
Continuous variables can take on an uncountably infinite number of values.

Example: Toss Two Coins

\(S={TT,TH,HT,HH}\)
Let \(X\) be the number of heads
- \(X(TT)=0\)
- \(X(TH)=1\)
- \(X(HT)=1\)
- \(X(HH)=2\)

Probability Distributions

Broadly probability distributions provide mathematical descriptions of random variables in terms of the probabilities of events.

The can be represented in terms of:

Probability Mass/Density Functions
- Discrete variables have probability mass functions (PMF)
- Continuous variables have probability density functions (PDF)
Cumulative Density Functions
- Discrete: Summation of discrete probabilities
- Continuous: Integration over a range of values

Discrete distributions

Probability Mass Function (pmf): \(f(x)=p(X=x)\)
Assigns probabilities to each unique event such that Kolmogorov Axioms (Positivity, Certainty, and Additivity) still apply
Cumulative Distribution Function (cdf) \(F(x_j)=p(X\leq x)=\sum_{i=1}^{j}p(x_i)\)
- Sum of the probability mass for events less than or equal to \(x_j\)

Example: Toss Two coins

\(S={TT,TH,HT,HH}\)
Let \(X\) be the number of heads
- \(X(TT)=0\)
- \(X(TH)=1\)
- \(X(HT)=1\)
- \(X(HH)=2\)
\(f(X=0)=p(X=0)=1/4\)
\(f(X=1)=p(X=1)=1/2\)
\(F(X\leq 1) = p(X \leq 1)= 3/4\)

Rolling a die

Each side has equal probability of occurring (1/6). The probability that you roll a 2 or less P(X<=2) = 1/6 + 1/6 = 1/3

Continuous distributions

Probability Density Functions (PDF): \(f(x)\)
- Assigns probabilities to events in the sample space such that Kolmogorov Axioms still apply
- But… since their are an infinite number of values a continuous variable could take, p(X=x)=0, that is, the probability that X takes any one specific value is 0.
Cumulative Distribution Function (CDF) \(F(x)=p(X\leq x)=\int_{-\infty}^{x}f(x)dx\)
- Instead of summing up to a specific value (discrete) we integrate over all possible values up to \(x\)
- Probability of having a value less than x

Integrals

What’s the area of the rectangle?

\(base\times height\)

Integrals

How would we find the area under a curve?

Integrals

Well suppose we added up the areas of a bunch of rectangles roughly whose height’s approximated the height of the curve?

Can we do any better?

Integrals

Let’s make the rectangles smaller

What happens as the width of rectangles get even smaller, approaches 0? Our approximation get’s even better

Link between PDF and CDF

If \[F(x)=p(X\leq x)=\int_{-\infty}^{x}f(x)dx \]

Then by the fundamental theorem of calculus

\[\frac{d}{dx}F(x)=f(x)\]

In words

the PDF \((f(x))\)is the derivative (rate of change) of the CDF \((F(X))\)
the CDF describes the area under the curve defined by f(x) up to x

Properties of the CDF

\(0\leq F(x) \leq 1\)
\(F\) is non-decreasing and right continuous
\(\lim_{x\to-\infty}F(x)=0\)
\(\lim_{x\to\infty}F(x)=1\)
For all \(a,b \in \mathbb{R}\) s.t. \(a<b\)

\[p(a < X \leq b) = F(b)- F(a) = \int_a^b f(x)dx \]

Recall the PMF and CDF of a die

What’s the probability

\(p(X=1)...p(X=6) = 1/6\)
\(p( 2 < X \leq 5) = F(5)-F(2)=5/6-1/6=4/6=2/3\)

What we’ll use proability distributions for:

In this course, we’ll use probability distributions to

Model the data generating process as a function of parameters we can estimate
To perform statitical inference based on asymptotic theory (statements about how statistics would be have as our sample size approached infinity)

Please memorize these over Spring Break:

Source

Expected Values and Variances

Expected Value

A (probability) weighted average of the possible outcomes of a random variable, often labeled \(\mu\)

Discrete:

\[\mu_X=E(X)=\sum xp(x)\]

Continuous

\[\mu_X=E(X)=\int_{-\infty}^{\infty}xf(x) dx\]

Condtional Expectations:

For a continuous variable:

\[ E(X|Y=y) = \int_{-\infty}^{\infty}xf_{x|y}(x|y) dx \]

Where:

\[ f_{x|y}(x|y) = \frac{f_{x,y}(x,y)}{f_y(y)} = \frac{\text{Joint distribution of X and Y}}{\text{Marginal distribution of Y}} \]

Which follows from the law of total probability

What’s the expected value of a 1 roll of fair die?

\[\begin{align*} E(X)&=\sum_{i=1}^{6}x_ip(x_i)\\ &=1/6\times(1+2+3+4+5+6)\\ &= 21/6\\ &=3.5 \end{align*}\]

Properties of Expected Values

\(E(c)=c\)
\(E(a+bX)=a+bE[X]\)
\(E[E[X]]=X\)
\(E[E[Y|X]]=E[Y]\)
\(E[g(X)]=\int_{-\infty}^\infty g(x)f(x)dx\)
\(E[g(X_1)+\dots+g(X_n)]=E[g(X_1)]+\dots E[g(X_n)\)
\(E[XY]=E[X]E[Y]\) if \(X\) and \(Y\) are independent

How many times would you have to roll a fair die to get all six sides?

We can think of this as the sum of the expected values for a series of geometric distributions with varying probabilities of success, \(p\). The expected value of a geometric variable is:

\[\begin{align*}E(X)&=\sum_{k=1}^{\infty}kp(1-p)^{k-1} \\ &=p\sum_{k=1}^{\infty}k(1-p)^{k-1} \\ &=p\left(-\frac{d}{dp}\sum_{k=1}^{\infty}(1-p)^k\right) \text{(Chain rule)} \\ &=p\left(-\frac{d}{dp}\frac{1-p}{p}\right) \text{(Geometric Series)} \\ &=p\left(\frac{d}{dp}\left(1-\frac{1}{p}\right)\right)=p\left(\frac{1}{p^2}\right)=\frac1p\end{align*}\]

Rolling a fair die to get all six sides

For this question, we need to calculate the probability of success, p, after getting a side we need.

The probability of getting a side you need on your first role is 1. The probability of getting a side you need on the second role, is 5/6 and so the expected number of roles is 6/5, and so the expected number of rolls to get all six is:

ev <- c()
for(i in 6:1){
  ev[i] <- 6/i
  
}
# Expected rolls for each 1 through 6th side
rev(ev)

[1] 1.0 1.2 1.5 2.0 3.0 6.0

# Total 
sum(ev)

[1] 14.7

Variance

If \(X\) has a finite mean \(E[X]=\mu\), then \(E[(X-\mu)^2]\) is finite and called the variance of \(X\) which we write as \(\sigma^2\) or \(Var[X]\).

Variance

\[\begin{align*} \sigma^2=E[(X-\mu)^2]&=E[(X^2-2\mu X+\mu^2)]\\ &= E[X^2]-2\mu E[X]+\mu^2\\ &= E[X^2]-2\mu^2+\mu^2\\ &= E[X^2]-\mu^2\\ &= E[X^2]-E[X]^2 \end{align*}\]

“The variance of X is equal to the expected value of X-squared, minus the square of X’s expected value.”
\(\sigma^2=E[X^2]-E[X]^2\) is a useful identity in proofs and derivations

Standard Deviations

A standard deviation is just the square root of the variance

\[\sigma=\sqrt{Var[X]}\]

Standard deviations are useful for describing:

A typical deviation from the mean/Expected value
The width or spread of a distribution

Covariance and correlation

Covariance measures the degree to which two random variables vary together.

\(Cov[X,Y] \to +\) An increase in \(X\) tends to be larger than its mean when \(Y\) is larger than its mean

\[Cov[X,Y]=E[(X-E[X])(Y-E[Y])]=E[XY]-E[X]E[Y]\]

The correlation between \(X\) and \(Y\) is simply the covariance of \(X\) and \(Y\) divided by the standard deviation of each.

\[\rho=\frac{Cov[X,Y]}{\sigma_X\sigma_Y}\]

Normalized covariance to a scale that runs between \([-1,1]\)

Properties of Variance and Covariance

\(Cov[X,Y]=E[XY]-E[X]E[Y]\)
\(Var[X]=E[X^2]-(E[X])^2\)
\(Var[X|Y]=E[X^2|Y]-(E[X|Y])^2\)
\(Cov[X,Y]=Cov[X,E[Y|X]]\)
\(Var[X+Y]=Var[X]+Var[Y]+2Cov[X,Y]\)
\(Var[Y]=Var[E[Y|X]]+E[Var[Y|X]]\)

What you need to know (WYNK)

Honestly, for this class, you won’t need to know these properties.

They’ll show up in proofs and theorems and become important when you’re trying to evaluate properties of an estimator (isn’t unbiased, is it “efficient”, or consistent does it have minimum variance?) but that’s for another day/course.

Summary: Random Variables and Probability Distributions

Random variables assign numeric values to each event in an experiment.
Probability distributions assign probabilities to the values that a random variable can take.
- Discrete distributions are described by their pmf and cdf
- Continuous distributions by their pdf and cdf

Summary: Random Variables and Probability Distributions

Probability distributions let us describe the data generating process and encode information about the world into our models
- There are lots of distributions
- Don’t worry about memorizing formulas
- Do develop intuitions about the nature of your data generating process (Is my outcome continuous or disrecte, binary or count, etc.)
Two key features of probability distributions are their:
- Expected values probability weighted averages
- Variances which quantify variation around expected values

Standard Errors for Regression

Interpreting regressions

Regression coefficients \((\beta)\) are crucial for substantive interpretations (sign and size)
The standard errors of these coefficients \((SE(\beta))\) are the key to evaluating the statistical significance of these coefficients

What’s a standard error?

The standard error of an estimate is the standard deviation of the theoretical sampling distribution
A sampling distribution is a distribution of the estimates we would observe in repeated sampling
- Example: Re-run the 1978 CPS, we get different respondents, and thus different estimates.
Standard errors describe the width of the sampling distribution
- How much our estimates might vary from the true (population) value from sample to sample.
Standard errors can be used to construct intervals and conduct tests that quantify our uncertainty about our estimate

Standard errors of regression coefficients

For a linear regression written in matrix notation:

\[ y = X\beta + \epsilon \]

OLS yields estimates of \(\beta\), \(\hat{\beta}\) by minimizing the sum of squared residuals

\[ \hat{\beta} = (X'X)^{-1}X'y \]

Standard errors of regression coefficients

One can show under a set of assumptions that variance-covariance matrix of \(\hat{\beta}\) is

\[ \begin{aligned} E[(\hat{\beta} -\beta)-(\hat{\beta} -\beta)'] &= \sigma^2(X'X)^{-1}\\ & = \begin{bmatrix} Var(\hat\beta_0) & Cov(\hat{\beta_0},\hat{\beta_1}) & \cdots & Cov(\hat{\beta_0},\hat{\beta_k}))\\ Cov(\hat\beta_1,\hat{\beta_0}) & Var(\hat{\beta_1},) & \cdots & Cov(\hat{\beta_1},\hat{\beta_k}))\\ \vdots & \vdots & \vdots & \vdots \\ Cov(\hat\beta_k,\hat{\beta_0}) & Cov(\hat{\beta_k},\hat{\beta_1}) & \cdots & Var(\hat{\beta_k})\\ \end{bmatrix} \end{aligned} \] Where we can estimate \(\sigma^2\) with \(\hat{\sigma}^2\) the mean \((1/n-k)\) squared error \((\epsilon'\epsilon)\) of the regression

\[ \hat{\sigma}^2 = \frac{\epsilon'\epsilon}{n-k} \]

Standard errors of regression coefficients

The standard error for the \(k\)th coefficient \(\beta_k\) is simply the square root of the of the \(k\)th diagnoal element of the variance-covariance matrix

\[ \text{SE}(\beta_k) = \sqrt{Var(\beta_k)} \]

Robust Standard Errors

\(\sigma^2(X'X)^{-1}\) is a good estimate of the variance of \(\hat{\beta}\) if the errors in a regression are independent and identically distributed (iid).

These turn out to be strong assumptions that are violated when there is:

Non-constant error variance (aka heteroskedasticity)
- The variance among the treated units tends to be higher than the variance among control units
Autocorrelation
- We observe the same unit over multiple periods (Say RI in 2016, 2018, 2020)
Clustering
- Respondents in RI are more similar to each other than respondents in MA

Robust Standard Errors

Robust standard errors are ways of calculating standard errors for regressions when we think the assumption of IID errors is unrealistic
- The assumption of IID is almost always unrealistic…
We call these of estimators robust because they provide consistent estimates of the SE even when errors are not independent and identically distributed.

Robust standard errors in R

`lm_robust()`

In this weeks lab we will get practice using the lm_robust() function from the estimatr.

As you will see, lm_robust() provides a convenient way to:

calculate a variety of robust standard errors using the se_type = "stata" argument for example to get the SEs Stata uses
include fixed effects using fixed_effects = ~ st + year argument
cluster standard errors by some grouping id variable cluster=st
generate estimates quickly using the Cholesky Decomposition

Previewing Lab 8

Overview

The goals of this weeks lab are to:

Help develop your inuition behind the Two-way Fixed Effects Estimator
Learn how to estimate models with fixed effects and robust clustered standard errors using lm_robust()
Interpret the marginal effects of interaction models

Question 8 from last week’s lab asked you to recreate Figure 2 from Grumbach and Hill (2022)

# Load data
load(url("https://pols1600.paultesta.org/files/data/cps_clean.rda"))

# Calculate turnout by age group and SDR
cps %>% 
  group_by(age_group, SDR) %>% 
  summarise(
    prop_vote = mean(dv_voted, na.rm=T)
  ) -> fig2_df

# Recreate figure 2
fig2_df %>% 
  filter(!is.na(age_group)) %>% 
  ggplot(aes(age_group,prop_vote,fill = SDR))+
  geom_bar(stat = "identity",
           position = "dodge") -> fig2

Q3.1 Describing variation across states

Task
Code
Fig 3.1
Fixed Effects

Q3 will ask you to describe variation in turnout across states and years, and policy.

Let’s get a little practice calculating turnout across states

# Calculate turnout by state
cps %>% 
  group_by(st) %>% 
  summarise(
    turnout = mean(dv_voted, na.rm=T)
  ) %>% 
  mutate(
    st = fct_reorder(st, turnout)
    ) -> df_state



# Visualize turnout by state
df_state %>% 
  ggplot(aes(turnout,st))+
  geom_bar(stat = "identity") -> fig3_1

As you can see, there is considerable variation in average turnout across States

Q3.2 will ask you to describe similar variation across years.

Q3.3 will then ask you to look at variation across SDR policy within a single state.

The goal these questions is to help illustrate motivation for including fixed effects as way of generalizing the logic of a difference in differences design

References

Grumbach, Jacob M, and Charlotte Hill. 2022. “Rock the Registration: Same Day Registration Increases Turnout of Young Voters.” The Journal of Politics 84 (1): 405–17.