Data and Measurement
Updated Mar 3, 2025
If it’s your first time here you’ll need to work through Software Setup to follow along today
If you’re still on the waitlist on CAB, speak to me after class
Once you’ve done the following
You can see the available problem sets by running the following code in your console:
And start a specific tutorial by running:
Important
Please upload tutorials 00-intro and 01-measurement1 to Canvas by Friday
It’s cousin Nick’s fault…
Funny icebreaker, but lots of assumptions…
You’re not a murderer
You don’t know someone who’s committed a murder or been murdered
You’ve got a mom and dad
How might we make this question better?
What questions we ask and how we ask them matters
What are we excited about?
What are we worried about?
R, R Studio and Quarto
Getting set up to work in R
Basic Programming in R
R is an open source statistical programming language (cheatsheet)
R Studio is an integrated development environment (IDE) that makes working in R much easier (cheatsheet)
Quarto is a publishing system that allows us to write and present code in different formats (cheatsheet)
Go to class content for current week
Open slides in browser
Open R Studio
Create .qmd file titled wk01-notes.qmd
and save in course folder
Get set up to work
Take notes and follow along
R is an interpreter (>)
Everything that exists in R is an object
Everything that happens in R is the result of a function
Data come in different types, shapes, and sizes
Packages extend what R can do
install.packages("pacakge_name")
once to download a packagelibrary("package_name")
Enter commands line-by-line in the console
The >
means R is a ready for a command
The +
means your last command isn’t complete
+
use your escape key!Send code from .qmd file to the console:
cntrl + Enter
(PC) | cmd + Return
(Mac) -> run current linecntrl + shift + Enter
(PC) | cmd + shift + Return
(Mac) -> run all code in current chunkOperator | Description | Usage |
---|---|---|
+ | addition | x + y |
- | subtraction | x - y |
* | multiplication | x * y |
/ | division | x / y |
^ | raised to the power of | x ^ y |
abs | absolute value | abs(x) |
%/% | integer division | x %/% y |
%% | remainder after division | x %% y |
Operator | Description | Usage |
---|---|---|
& | and | x & y |
| | or | x | y |
xor | exactly x or y | xor(x, y) |
! | not | !x |
Operator | Description | Usage |
---|---|---|
< | less than | x < y |
<= | less than or equal to | x <= y |
> | greater than | x > y |
>= | greater than or equal to | x >= y |
== | exactly equal to | x == y |
!= | not equal to | x != y |
%in% | group membership* | x %in% y |
is.na | is missing | is.na(x) |
!is.na | is not missing | !is.na(x) |
Tip
One common portal of discovery, is that a function in R is expecting data of one type but (e.g. numeric
) but actually gets data of different type (e.g. character
)
The class()
function is a useful base R
for troubleshooting such errors.
Source: Gaurav Tiwari
Name | “Size” | Type of Data | R code |
---|---|---|---|
scalar | 1 | numeric, character, factor, logical | x <- 5 |
vector | N elements: length(x) |
all the same | v <- c(1, 2, T, "false") |
matrix | N rows by columns K: dim(x) |
all the same | m <- matrix(y,2,2) |
array | N row by K column by J dimensions: dim(x) |
all the same | a <- array(m,c(2,2,3)) |
data frames | N row by K column matrix | can be different | d <-data.frame(x=x, y=y) |
tibbles | N row by K column matrix | can be different | d <-tibble(x=x, y=y) |
lists | can vary | can be different | l <-list(x,y,m,a,d) |
<-
is the assignement operator that assigns a value to a namec()
is the combine function that combines elements togetherinstall.packages()
installs packageslibrary()
loads packages you’ve installed so you can use functions and data that are part of that packageThree sources of functions:
<-; mean(x); library("package_name")
install.packages("packageName)"
remotes::intall_github("user/repository")
my_function <- function(x){x^2}
Tip
Can you spot the portal of discovery in the code above?
They have:
names
ingredients (inputs)
steps that tell you what to do with the ingredients (statements/code)
tasty results from applying those steps to given ingredients (outputs)
can_x_kick_it <- function(x){
# Determine if x can kick it
# If x in A Tribe Called Quest
if(x %in% c("Q-Tip","Phife Dawg",
"Ali Shaheed Muhammad",
"Jarobi White")){
return("Yes you can")
}else{
return("Before this, did you really know what live was?")
}
}
can_x_kick_it("Q-Tip")
[1] "Yes you can"
[1] "Before this, did you really know what live was?"
Each time you start a project in R, you will want to:
Set your working directory in R Studio
Load (and if needed, install) the R packages you will use
Set any “global” options you want
Load the data you’ll be using
Install packages once1 with install.packages("package_name")
Load packages every session with library("package_name")
Let’s install the tidyverse
and COVID19
.
libraries
Keyboard Shortcuts to toggle #
comments
macOS: CMD
+ SHIFT
+ C
PC: CTRL
+ SHIFT
+ C
tidyverse
and COVID19
packagesThere are three ways to load data.
Load a pre-existing dataset
data("dataset")
will load the dataset named “dataset”
data()
will list all datasetsLoad a .Rdata/.rda file using load("dataset.rda")
Read data of a different format (.csv, .dta, .spss) into R using specific functions from packages like haven
and readr
Loading data into R
Looking at your data
Cleaning and transforming your data
There are three ways to load data.
data("dataset")
will load the dataset named “dataset”data()
will list all datasetsload("dataset.rda")
haven
and readr
packages to read data from the web and stored locally on your computerThe code below downloads two years of daily state-level Covid data:
Please run the following1
country = US
tells the function we want data for the USstart = "2020-01-01"
sets the start date for the dataend = "2022-12-31""
sets the end date for the datalevel = 2
tells the function we want state-level dataverbose = F
tells the function not to print other stuffcovid <- COVID19::covid19( country = "US", start = "2020-01-01", end = "2022-12-31", level = 2, verbose = F )
covid <- COVID19::covid19( country = "US", start = "2020-01-01", end = "2022-12-31", level = 2, verbose = F )
covid <- COVID19::covid19( country = "US", start = "2020-01-01", end = "2022-12-31", level = 2, verbose = F )
covid <- COVID19::covid19( country = "US", start = "2020-01-01", end = "2022-12-31", level = 2, verbose = F )
covid <- COVID19::covid19( country = "US", start = "2020-01-01", end = "2022-12-31", level = 2, verbose = F )
covid <- COVID19::covid19( country = "US", start = "2020-01-01", end = "2022-12-31", level = 2, verbose = F )
The first time you load a dataset into R, you should try to get a high-level overview (HLO) of the data
Tip
This is an iterative, dynamic process. Something you do “live” but won’t necessarily save in your final code you submit. Over time you’ll develop intuitions about what to look for, what questions to ask of your data, and what functions and code will help you answer these questions.
For example, I might want to know, how many unique values the variable school_closing
in the covid
dataset takes?
NA
s)dim(data)
gives you the dimensions (# of rows and columns)View(data)
opens data in a separate paneprint(data); data
will display a truncated view of data in your consoleglimpse(data)
will show a transposed (switch columns and rows) version of data with information on variable typehead(data)
shows you the first 5 rowstail(data)
shows you the last 5 rowsdata$variable
extracts variable
from data
table(data$variable)
creates a frequency table
table(data$variable1, data$variable2)
creates a “crosstab” or contingency table
summary(data$variable)
summary statistics
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
For more check out R for Data Science
Tidy data is a standard way of mapping the meaning of a dataset to its structure.
A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
dplyr
functions for data wranglingToday and this week will begin learning some tools for selecting and transforming data:
select()
to select columns from a dataframefilter()
to select rows from a dataframe when some statement is TRUE
mutate()
to create new colums
case_when()
to recode values when some statement is TRUE
summarise()
to transform many values into one valuegroup_by()
to create a grouped table so that other functions are applied separately to each group and then combined%>%
(“pipe” operator)The %>%
lets us chain functions together so we can read left to right
Becomes
Keyboard Shortcuts for %>%
macOS: CMD
+ SHIFT
+ M
PC: CTRL
+ SHIFT
+ M
n.b. |>
is the base
R or native pipe. It’s similar in function but also subtlely different in execution to the %>%
in ways that won’t matter for us right now.
When social scientists talk about descriptive inference, we’re trying to summarize our data and make claims about what’s typical of our data
Here are some common ways of summarizing data and how to calculate them with R
Description | Usage |
---|---|
sum | sum(x) |
minimum | min(x) |
maximum | max(x) |
range | range(x) |
mean | mean(x) |
median | median(x) |
percentile | quantile(x) |
variance | var(x) |
standard deviation | sd(x) |
rank | rank(x) |
All of these functions have an argument called na.rm=F
. If your data have missing values, you’ll need to set na.rm=F
(e.g. mean(x, na.rm=T)
)
Let’s spend the rest of class, exploring what seems like a simple question
On average, did states that adopted mask mandates have lower rates of new cases?
Get a high level overview of our data
Subset the data to just U.S. States
Recode our data to get a measure of new Covid cases and what face mask policy policy was in place
Summarize the average number of new cases by face mask policy.
#| label:"HLO"
#
confirmed
facial_coverings
take?1Goal: Subset our Covid data to include only the 50 states + DC
Steps:
Create a vector of the territories we don’t want
Use the filter()
command to “filter” out these territories
filter()
command to “filter” out these territoriesGoal: We need new variables that describe:
the number of new Covid-19 cases on a given date
the face mask policy in place
Steps:
Use mutate()
, group_by()
and lag()
to calculate new_cases
from total confirmed
cases
Use mutate()
, case_when()
and abs()
to turn numeric facial_coverings
into categorical factor variable
Please run and comment the following code:
covid_us %>%
mutate(
face_masks = case_when(
facial_coverings == 0 ~ "No policy",
abs(facial_coverings) == 1 ~ "Recommended",
abs(facial_coverings) == 2 ~ "Some requirements",
abs(facial_coverings) == 3 ~ "Required shared places",
abs(facial_coverings) == 4 ~ "Required all times",
)
) -> covid_us
levels(factor(covid_us$face_masks))
[1] "No policy" "Recommended" "Required all times"
[4] "Required shared places" "Some requirements"
face_masks
a factor to reflect order of policies[1] "No policy" "Recommended" "Required all times"
[4] "Required shared places" "Some requirements"
covid_us %>%
mutate(
face_masks = factor(
face_masks,
levels = c(
"No policy",
"Recommended",
"Some requirements",
"Required shared places",
"Required all times"
)
)
) -> covid_us
levels(covid_us$face_masks)
[1] "No policy" "Recommended" "Some requirements"
[4] "Required shared places" "Required all times"
Goal: On average, did states that adopted mask mandates have lower rates of new cases?
Steps: use filter()
, group_by()
and summarise()
and mean()
to calculate the average number of cases for each level of the face_masks
policy variable
What should we conclude?
What’s wrong with this simple comparison?
What’s a better comparison? (Thursday)
Face Mask Policy | Average No. of New Cases |
---|---|
No policy | 10.26 |
Recommended | 16.61 |
Some requirements | 36.18 |
Required shared places | 29.38 |
Required all times | 32.18 |
# ---- Libraries ----
## Uncomment to install
# install.packages("tidyverse")
# install.packages("COVID19")
library("tidyverse")
library("COVID19")
# ---- Load data ----
load(url("https://pols1600.paultesta.org/files/data/covid.rda"))
# ---- Subset to US states and DC ----
territories <- c(
"American Samoa",
"Guam",
"Northern Mariana Islands",
"Puerto Rico",
"Virgin Islands"
)
covid_us <- covid %>%
filter(!administrative_area_level_2 %in% territories )
## Check subsetting
dim(covid)[1] > dim(covid_us)[1]
# ---- Recode covid_us ----
covid_us %>%
mutate(
state = administrative_area_level_2,
) %>%
dplyr::group_by(state) %>%
mutate(
new_cases = confirmed - lag(confirmed),
new_cases_pc = new_cases/population *100000
) %>%
mutate(
face_masks = case_when(
facial_coverings == 0 ~ "No policy",
abs(facial_coverings) == 1 ~ "Recommended",
abs(facial_coverings) == 2 ~ "Some requirements",
abs(facial_coverings) == 3 ~ "Required shared places",
abs(facial_coverings) == 4 ~ "Required all times"
)
) %>%
mutate(
face_masks = factor(
face_masks,
levels = c(
"No policy",
"Recommended",
"Some requirements",
"Required shared places",
"Required all times"
)
)
)-> covid_us
# ---- Calculate new cases per capita by facemask policy
covid_us %>%
filter(!is.na(face_masks))%>%
group_by(face_masks)%>%
summarize(
`Average No. of New Cases` = round(mean(new_cases_pc, na.rm=T),2)
)%>%
rename(
"Face Mask Policy" = face_masks
) -> face_mask_summary
face_mask_summary
After today, you should have a better sense of
How to write R code using Quarto and R Markdown
How to install packages and load libraries
Some of different types and shapes of data
How to get a high level overview of your data
How to transform, recode, and summarise data using dplyr
and the tidyverse
How describe typical values and variation in data
How to explore substantive questions using these these typical values
We covered A LOT
It’s OK to feel overwhelmed
Don’t worry if everything didn’t make sense.
POLS 1600