Data and Measurement
Updated May 31, 2024
If it’s your first time here you’ll need to work through Software Setup to follow along today
If you’re still on the waitlist on CAB, speak to me after class
“Uh ohhh, the Cavs are playing playoff basketball” pic.twitter.com/WrOrzeuEtW
— Bottlegate ((Bottlegate?)) April 21, 2017
Once you’ve done the following
You can see the available problem sets by running the following code in your console:
And start a specific tutorial by running:
Please upload tutorials 00-intro and 01-measurement1 to Canvas by Friday
It’s cousin Nick’s fault…
Funny icebreaker, but lots of assumptions…
You’re not a murderer
You don’t know someone who’s committed a murder or been murdered
You’ve got a mom and dad
How might we make this question better?
What questions we ask and how we ask them matters
What are we excited about?
What are weworried about?
R, R Studio and Quarto
Getting set up to work in R
Basic Programming in R
R is an open source statistical programming language (cheatsheet)
R Studio is an integrated development environment (IDE) that makes working in R much easier (cheatsheet)
Quarto is a publishing system that allows us to write and present code in different formats (cheatsheet)
Go to class content for current week
Open slides in browser
Open R Studio
Create .qmd file titled wk01-notes.qmd
and save in course folder
Get set up to work
Take notes and follow along
R is an interpreter (>)
“Everything that exists in R is an object”
“Everything that happens in R is the result of a function”
Data come in different types, shapes, and sizes
Packages make R great
Enter commands line-by-line in the console
The >
means R is a ready for a command
The +
means your last command isn’t complete
+
use your escape key!Send code from .qmd file to the console:
cntrl + Enter
(PC) | cmd + Return
(Mac) -> run current linecntrl + shift + Enter
(PC) | cmd + shift + Return
(Mac) -> run all code in current chunkOperator | Description | Usage |
---|---|---|
+ | addition | x + y |
- | subtraction | x - y |
* | multiplication | x * y |
/ | division | x / y |
^ | raised to the power of | x ^ y |
abs | absolute value | abs(x) |
%/% | integer division | x %/% y |
%% | remainder after division | x %% y |
Operator | Description | Usage |
---|---|---|
& | and | x & y |
| | or | x | y |
xor | exactly x or y | xor(x, y) |
! | not | !x |
Operator | Description | Usage |
---|---|---|
< | less than | x < y |
<= | less than or equal to | x <= y |
> | greater than | x > y |
>= | greater than or equal to | x >= y |
== | exactly equal to | x == y |
!= | not equal to | x != y |
%in% | group membership* | x %in% y |
is.na | is missing | is.na(x) |
!is.na | is not missing | !is.na(x) |
Source: Gaurav Tiwari
Name | “Size” | Type of Data | R code |
---|---|---|---|
scalar | 1 | numeric, character, factor, logical | x <- 5 |
vector | N elements: length(x) |
all the same | v <- c(1, 2, T, "false") |
matrix | N rows by columns K: dim(x) |
all the same | m <- matrix(y,2,2) |
array | N row by K column by J dimensions: dim(x) |
all the same | a <- array(m,c(2,2,3)) |
data frames | N row by K column matrix | can be different | d <-data.frame(x=x, y=y) |
tibbles | N row by K column matrix | can be different | d <-tibble(x=x, y=y) |
lists | can vary | can be different | l <-list(x,y,m,a,d) |
<-
is the assignement operator that assigns a value to a namec()
is the combine function that combines elements togetherinstall.packages()
installs packageslibrary()
loads packages you’ve installed so you can use functions and data that are part of that packageThree sources of functions:
<-; mean(x); library("package_name")
install.packages("packageName)"
remotes::intall_github("user/repository")
my_function <- function(x){x^2}
They have:
names
ingredients (inputs)
steps that tell you what to do with the ingredients (statements/code)
tasty results from applying those steps to given ingredients (outputs)
can_x_kick_it <- function(x){
# Determine if x can kick it
# If x in A Tribe Called Quest
if(x %in% c("Q-Tip","Phife Dawg",
"Ali Shaheed Muhammad",
"Jarobi White")){
return("Yes you can")
}else{
return("Before this, did you really know what live was?")
}
}
can_x_kick_it("Q-Tip")
[1] "Yes you can"
[1] "Before this, did you really know what live was?"
Each time you start a project in R, you will want to:
Set your working directory
Load (and if needed, install) the R packages you will use
Set any “global” options you want
Load the data you’ll be using
Install packages once1 with install.packages("package_name")
Load packages every session with library("package_name")
Let’s install the tidyverse
and COVID19
.
libraries
Keyboard Shortcuts to toggle #
comments
macOS: CMD
+ SHIFT
+ C
PC: CTRL
+ SHIFT
+ C
tidyverse
and COVID19
packagesHere are the global options for these slides:1
# Options for these slides
knitr::opts_chunk$set(
warning = FALSE, # Don't display warnings
message = FALSE, # Don't display messages
comment = NA, # No prefix before line of text
dpi = 300, # Figure resolution
fig.align = "center", # Figure alignment
out.width = "80%", # Figure width
cache = FALSE # Don't cache code chunks
)
There are three ways to load data.
data("dataset")
will load the dataset named “dataset”data()
will list all datasetsload("dataset.rda")
haven
and readr
Loading data into R
Looking at your data
Cleaning and transforming your data
There are three ways to load data.
data("dataset")
will load the dataset named “dataset”data()
will list all datasetsload("dataset.rda")
haven
and readr
packages to read data from the web and stored locally on your computerThe code below downloads two years of daily state-level Covid data:
Please run the following1
country = US
tells the function we want data for the USstart = "2020-01-01"
sets the start date for the datastart = "2020-01-01"
sets the end date for the datalevel = 2
tells the function we want state-level dataverbose = F
tells the function not to print other stuffAnytime you load data into R, try some combination of the following to get a high-level overview (HLO) of the data
dim(data)
gives you the dimensions (# of rows and columns)View(data)
opens data in a separate paneprint(data); data
will display a truncated view of data in your consoleglimpse(data)
will show a transposed (switch columns and rows) version of data with information on variable typehead(data)
shows you the first 5 rowstail(data)
shows you the last 5 rowsdata$variable
extracts variable
from data
table(data$variable)
creates a frequency table
summary(data$variable)
summary statistics
NA
s)The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
For more check out R for Data Science
Tidy data is a standard way of mapping the meaning of a dataset to its structure.
A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
dplyr
functions for data wranglingToday and this week will begin learning some tools for selecting and transforming data:
select()
to select columns from a dataframefilter()
to select rows from a dataframe when some statement is TRUE
mutate()
to create new colums
case_when()
to recode values when some statement is TRUE
summarise()
to transform many values into one valuegroup_by()
to create a grouped table so that other functions are applied separately to each group and then combined%>%
(“pipe” operator)The %>%
lets us chain functions together so we can read left to right
Becomes
Keyboard Shortcuts for %>%
macOS: CMD
+ SHIFT
+ M
PC: CTRL
+ SHIFT
+ M
When social scientists talk about descriptive inference, we’re trying to summarize our data and make claims about what’s typical of our data
Here are some common ways of summarizing data and how to calculate them with R
Description | Usage |
---|---|
sum | sum(x) |
minimum | min(x) |
maximum | max(x) |
range | range(x) |
mean | mean(x) |
median | median(x) |
percentile | quantile(x) |
variance | var(x) |
standard deviation | sd(x) |
rank | rank(x) |
All of these functions have an argument called na.rm=F
. If your data have missing values, you’ll need to set na.rm=F
(e.g. mean(x, na.rm=T)
)
Let’s spend the rest of class, exploring what seems like a simple question
On average, did states that adopted mask mandates have lower rates of new cases?
Get a high level overview of our data
Subset the data to just U.S. States
Recode our data to get a measure of new Covid cases and what face mask policy policy was in place
Summarize the average number of new cases by face mask policy.
#| label:"HLO"
#
confirmed
facial_coverings
take?1Goal: Subset our Covid data to include only the 50 states + DC
Steps:
Create a vector of the territories we don’t want
Use the filter()
command to “filter” out these territories
filter()
command to “filter” out these territoriesGoal: We need new variables that describe:
the number of new Covid-19 cases on a given date
the face mask policy in place
Steps:
Use mutate()
, group_by()
and lag()
to calculate new_cases
from total confirmed
cases
Use mutate()
, case_when()
and abs()
to turn numeric facial_coverings
into categorical factor variable
Please run and comment the following code:
covid_us %>%
mutate(
face_masks = case_when(
facial_coverings == 0 ~ "No policy",
abs(facial_coverings) == 1 ~ "Recommended",
abs(facial_coverings) == 2 ~ "Some requirements",
abs(facial_coverings) == 3 ~ "Required shared places",
abs(facial_coverings) == 4 ~ "Required all times",
)
) -> covid_us
levels(factor(covid_us$face_masks))
[1] "No policy" "Recommended" "Required all times"
[4] "Required shared places" "Some requirements"
face_masks
a factor to reflect order of policies[1] "No policy" "Recommended" "Required all times"
[4] "Required shared places" "Some requirements"
covid_us %>%
mutate(
face_masks = factor(
face_masks,
levels = c(
"No policy",
"Recommended",
"Some requirements",
"Required shared places",
"Required all times"
)
)
) -> covid_us
levels(covid_us$face_masks)
[1] "No policy" "Recommended" "Some requirements"
[4] "Required shared places" "Required all times"
Goal: On average, did states that adopted mask mandates have lower rates of new cases?
Steps: use filter()
, group_by()
and summarise()
and mean()
to calculate the average number of cases for each level of the face_masks
policy variable
What should we conclude?
What’s wrong with this simple comparison?
What’s a better comparison? (Thursday)
Face Mask Policy | Average No. of New Cases |
---|---|
No policy | 10.26 |
Recommended | 16.61 |
Some requirements | 36.18 |
Required shared places | 29.38 |
Required all times | 32.18 |
# ---- Libraries ----
## Uncomment to install
# install.packages("tidyverse")
# install.packages("COVID19")
library("tidyverse")
library("COVID19")
# ---- Load data ----
load(url("https://pols1600.paultesta.org/files/data/covid.rda"))
# ---- Subset to US states and DC ----
territories <- c(
"American Samoa",
"Guam",
"Northern Mariana Islands",
"Puerto Rico",
"Virgin Islands"
)
covid_us <- covid %>%
filter(!administrative_area_level_2 %in% territories )
## Check subsetting
dim(covid)[1] > dim(covid_us)[1]
# ---- Recode covid_us ----
covid_us %>%
mutate(
state = administrative_area_level_2,
) %>%
dplyr::group_by(state) %>%
mutate(
new_cases = confirmed - lag(confirmed),
new_cases_pc = new_cases/population *100000
) %>%
mutate(
face_masks = case_when(
facial_coverings == 0 ~ "No policy",
abs(facial_coverings) == 1 ~ "Recommended",
abs(facial_coverings) == 2 ~ "Some requirements",
abs(facial_coverings) == 3 ~ "Required shared places",
abs(facial_coverings) == 4 ~ "Required all times"
)
) %>%
mutate(
face_masks = factor(
face_masks,
levels = c(
"No policy",
"Recommended",
"Some requirements",
"Required shared places",
"Required all times"
)
)
)-> covid_us
# ---- Calculate new cases per capita by facemask policy
covid_us %>%
filter(!is.na(face_masks))%>%
group_by(face_masks)%>%
summarize(
`Average No. of New Cases` = round(mean(new_cases_pc, na.rm=T),2)
)%>%
rename(
"Face Mask Policy" = face_masks
) -> face_mask_summary
face_mask_summary
After today, you should have a better sense of
How to write R code using Quarto and R Markdown
How to install packages and load libraries
Some of different types and shapes of data
How to get a high level overview of your data
How to transform, recode, and summarise data using dplyr
and the tidyverse
How describe typical values and variation in data
How to explore substantive questions using these these typical values
We covered A LOT
It’s OK to feel overwhelmed
Don’t worry if everything didn’t make sense.
POLS 1600