Probability is defined by three rules or assumptions called the Kolmogorov Axioms
Positivity: The probability of any event is nonnegative
Certainty: The probability that one of the outcomes in the sample space occurs is 1
Additivity: If events and are mutually exclusive, then:
The Addition Rule
For events, and , the addition rule says we can find the probability of either or occurring:
In words: The probability of either A or B occurring is the probability that A occurs plus the probability that B occurs - minus the probability that both occur (so that we’re not double counting…)
The are two broad ways of interpreting what probabilities mean:
Frequentist
Bayesian
Frequentist interpretations of probability
Probability describes how likely it is that some event happens.
Flip a fair coin, the probability of heads is Pr(Heads) = 0.5
Frequentist: view this probability as the limit of the relative frequency of an event over repeated trials.
Pr(E) = \lim_{n \to \infty} \frac{n_{E}}{n} \approx \frac{ \text{# of Times E happened}}{\text{Total # of Trials}}
Thinking about probability as a relative frequency, requires us to know how to count the number of times an event occurred (see also)
Frequentist interpretations of probability
Probabilities from a Frequentist perspective are defined by fixed and unknownparameters
The goal of statistics for a frequentist is to learn about these parameters from data.
Frequentist statistics often ask questions like “What is the probability of observing some data , given a hypothesis about the true value of parameter(s), , that generated it.
Frequentist interpretations of probability
For example, suppose we wanted to test whether a coin is “fair” We could:
Flip a fair coin 10 times. Our estimate of the is the number of heads divided by 10. It could be 0.5, but also 0 or 1, or some number in between.
Flip a coin 100 times and our estimate will be closer to the true .
Flip a coin an amount of times and the relative frequency will converge to the true parameter
Bayesian interpretations of probability
Frequentist interpretations make sense for describing processes that we could easily repeat (e.g. Coin flips, Surveys, Experiments)
But feel more convoluted when trying to describe events like “the probability of that Biden wins reelection.”
Bayesian interpretations of probability view probabilities as subjective beliefs.
The task for a Bayesian statistics is to update these prior beliefs () based on a model of the likelihood of observing some data to form new beliefs after observing the data (called the posterior beliefs).
Bayesian Updating
Bayesians update their beliefs according to Bayes Rule, which says:
More formally:
Bayesian vs Frequentists
Our two main tools for doing statistical inference in this course
Hypothesis Testing
Interval Estimation
Follow largely from frequentist interpretations of probability
Bayesian vs Frequentists
The differences between Bayesian and Frequentist frameworks, are both philosophical and technical in nature
Is probability a relative frequency or subjective belief? How do we form and use prior beliefs
Bayesian statistics relies heavily on algorhithms for Markov Chain Monte-Carlo simulations made possible by advances in computing.
For most of the questions in this course, these two frameworks will yield similar (even identical) conclusions.
Sometimes it’s helpful to think like a Bayesian, others, like a frequentist
Summary: Probability
Probability is a measure of uncertainty telling us how likely an event (or events) is (are) to occur
Probabilities are:
Non-negative
Unitary
Additive
Two different interpretations of probability:
Frequentists: Probability is a long run relative frequency
Bayesians: Probability reflect subjective beliefs which we update upon observing data
Conditional Probability
Conditional Probability: Definition
The conditional probability that event A occurred, given that event B occurred is written as (“The probability of A given B”) and defined as:
is the same as is the joint probability of both events occurring
is the marginal probability of B occuring
Conditional Probability: Multiplication Rule
Joint probabilities are symmetrical. .
By rearranging terms:
We get the multiplication rule:
The Law of Total Probability (Part 2)
We can use multiplication rule to derive an alternative form of the law of total probability:
Independence
Events and are independent if
Conceputally, If and are independent knowing whether occurred, tells us nothing about , and so the conditional probability of given , is equal to the unconditional, or marginal probability,
Independence
Formally, two events are statistically independent if and only if the joint probability is equal to product of the marginal probabilities
Conditional Independence
We can extend the concept of independence to situations with more than two events:
If events , , and are jointly independent then:
Joint independence implies pairwise independence and conditional independence:
But not the reverse.
Bayes Rule
Bayes rule is theorem for how we should update our beliefs about given that occurred:
Where
is called the prior probability of A (our initial belief)
is called the posterior probability of A given B (our updated belief after observing B)
What’s the probability you have Covid-19 given a positive test
Possible Outcomes
Four possible outcomes
Test
Have Covid
Don't Have Covid
Positive
True Positive
False Positive
Negative
False Negative
True Negative
What’s the probability you have Covid-19 given a positive test
Let’s assume:
1 out 100 people have Covid-19
Our test correctly identifies true positives 95 percent of the time (sensitivity = True Positive Rate)
Our test correctly identifies true negatives 95 percent of the time (specificity = True Negative Rate)
In a sample of 100,000 people then:
Test
Have Covid
Don't Have Covid
Positive
950
4950
Negative
50
94050
What’s the probability you have Covid-19 given a positive test
Now we can calculate the relevant quantities for:
Which yields:
What if you took a second test?
We could use our updated posterior belief as our new prior:
Now we’re much more confident that we have Covid-19
Random Variables and Probability Distributions
Random Variables
Random variables assign numeric values to each event in an experiment.
Mutually exclusive and exhaustive, together cover the entire sample space.
Discrete random variables take on finite, or countably infinite distinct values.
Continuous variables can take on an uncountably infinite number of values.
Example: Toss Two Coins
Let be the number of heads
Probability Distributions
Broadly probability distributions provide mathematical descriptions of random variables in terms of the probabilities of events.
The can be represented in terms of:
Probability Mass/Density Functions
Discrete variables have probability mass functions (PMF)
Continuous variables have probability density functions (PDF)
Cumulative Density Functions
Discrete: Summation of discrete probabilities
Continuous: Integration over a range of values
Discrete distributions
Probability Mass Function (pmf):
Assigns probabilities to each unique event such that Kolmogorov Axioms (Positivity, Certainty, and Additivity) still apply
Cumulative Distribution Function (cdf)
Sum of the probability mass for events less than or equal to
Example: Toss Two coins
Let be the number of heads
Rolling a die
Each side has equal probability of occurring (1/6). The probability that you roll a 2 or less P(X<=2) = 1/6 + 1/6 = 1/3
Continuous distributions
Probability Density Functions (PDF):
Assigns probabilities to events in the sample space such that Kolmogorov Axioms still apply
But… since their are an infinite number of values a continuous variable could take, p(X=x)=0, that is, the probability that X takes any one specific value is 0.
Cumulative Distribution Function (CDF)
Instead of summing up to a specific value (discrete) we integrate over all possible values up to
Probability of having a value less than x
Integrals
What’s the area of the rectangle?
Integrals
How would we find the area under a curve?
Integrals
Well suppose we added up the areas of a bunch of rectangles roughly whose height’s approximated the height of the curve?
Can we do any better?
Integrals
Let’s make the rectangles smaller
What happens as the width of rectangles get even smaller, approaches 0? Our approximation get’s even better
A (probability) weighted average of the possible outcomes of a random variable, often labeled
Discrete:
Continuous
Condtional Expectations:
For a continuous variable:
Where:
Which follows from the law of total probability
What’s the expected value of one roll of fair die?
Properties of Expected Values
if and are independent
How many times would you have to roll a fair die to get all six sides?
We can think of this as the sum of the expected values for a series of geometric distributions with varying probabilities of success, . The expected value of a geometric variable is:
Rolling a fair die to get all six sides
For this question, we need to calculate the probability of success, p, after getting a side we need.
The probability of getting a side you need on your first role is 1. The probability of getting a side you need on the second role, is 5/6 and so the expected number of roles is 6/5, and so the expected number of rolls to get all six is:
ev <-c()for(i in6:1){ ev[i] <-6/i}# Expected rolls for each 1 through 6th siderev(ev)
[1] 1.0 1.2 1.5 2.0 3.0 6.0
# Total sum(ev)
[1] 14.7
Variance
If has a finite mean , then is finite and called the variance of which we write as or .
Variance
“The variance of X is equal to the expected value of X-squared, minus the square of X’s expected value.”
is a useful identity in proofs and derivations
Standard Deviations
A standard deviation is just the square root of the variance
Standard deviations are useful for describing:
A typical deviation from the mean/Expected value
The width or spread of a distribution
Covariance and correlation
Covariance measures the degree to which two random variables vary together.
An increase in tends to be larger than its mean when is larger than its mean
The correlation between and is simply the covariance of and divided by the standard deviation of each.
Normalized covariance to a scale that runs between
Properties of Variance and Covariance
What you need to know (WYNK)
Honestly, for this class, you won’t need to know these properties.
They’ll show up in proofs and theorems and become important when you’re trying to evaluate properties of an estimator (isn’t unbiased, is it “efficient”, or consistent does it have minimum variance?) but that’s for another day/course.
Summary: Random Variables and Probability Distributions
Random variables assign numeric values to each event in an experiment.
Probability distributions assign probabilities to the values that a random variable can take.
Discrete distributions are described by their pmf and cdf
Continuous distributions by their pdf and cdf
Summary: Random Variables and Probability Distributions
Probability distributions let us describe the data generating process and encode information about the world into our models
There are lots of distributions
Don’t worry about memorizing formulas
Do develop intuitions about the nature of your data generating process (Is my outcome continuous or disrecte, binary or count, etc.)
Two key features of probability distributions are their:
Expected values probability weighted averages
Variances which quantify variation around expected values
Standard Errors for Regression
Interpreting regressions
Regression coefficients are crucial for substantive interpretations (sign and size)
The standard errors of these coefficients are the key to evaluating the statistical significance of these coefficients
What’s a standard error?
The standard error of an estimate is the standard deviation of the theoretical sampling distribution
A sampling distribution is a distribution of the estimates we would observe in repeated sampling
Example: Re-run the 1978 CPS, we get different respondents, and thus different estimates.
Standard errors describe the width of the sampling distribution
How much our estimates might vary from the true (population) value from sample to sample.
Standard errors can be used to construct intervals and conduct tests that quantify our uncertainty about our estimate
Standard errors of regression coefficients
For a linear regression written in matrix notation:
OLS yields estimates of , by minimizing the sum of squared residuals
The variance among the treated units tends to be higher than the variance among control units
Autocorrelation
We observe the same unit over multiple periods (Say RI in 2016, 2018, 2020)
Clustering
Respondents in RI are more similar to each other than respondents in MA
Robust Standard Errors
Robust standard errors are ways of calculating standard errors for regressions when we think the assumption of IID errors is unrealistic
The assumption of IID is almost always unrealistic…
We call these of estimators robust because they provide consistent estimates of the SE even when errors are not independent and identically distributed.
Robust standard errors in R
lm_robust()
In this weeks lab we will get practice using the lm_robust() function from the estimatr.
As you will see, lm_robust() provides a convenient way to:
calculate a variety of robust standard errors using the se_type = "stata" argument for example to get the SEs Stata uses
include fixed effects using fixed_effects = ~ st + year argument
cluster standard errors by some grouping id variable cluster=st
generate estimates quickly using the Cholesky Decomposition
Previewing Lab 8
Overview
The goals of this weeks lab are to:
Help develop your inuition behind the Two-way Fixed Effects Estimator
Learn how to estimate models with fixed effects and robust clustered standard errors using lm_robust()
Interpret the marginal effects of interaction models
As you can see, there is considerable variation in average turnout across States
Q3.2 will ask you to describe similar variation across years.
Q3.3 will then ask you to look at variation across SDR policy within a single state.
The goal these questions is to help illustrate motivation for including fixed effects as way of generalizing the logic of a difference in differences design
References
Grumbach, Jacob M, and Charlotte Hill. 2022. “Rock the Registration: Same Day Registration Increases Turnout of Young Voters.”The Journal of Politics 84 (1): 405–17.