Submit tutorials from last week for full credit by this Sunday.
Groups for the course assigned next week
Setup
Setup for today
Libraries
This week we’ll use the following libraries.
the_packages <-c(## R Markdown"tinytex", "kableExtra",## Tidyverse"tidyverse","lubridate", "forcats", "haven","labelled",## Extensions for ggplot"ggmap","ggrepel", "ggridges", "ggthemes","ggpubr","GGally",# Data "COVID19","maps","mapdata","DT")the_packages
R may ask you to install a package’s dependencies (other packages your package needs). Try entering the number 1 into your console
R may tell you need to restart R Try saying yes. If it doesn’t start downloading, say no
R may then ask if you want to compile some packages from source. Type Y into your console. If this doesn’t work, try again, but this time type N when asked
Loading the Covid-19 Data
Let’s load the Covid-19 data we worked with last week:
Copy and paste the “general part” of error message into Google.
Knit your document after each completed code chunk
This will run the code from top to bottom, and stop when it encounters an error
Try commenting out the whole chunk, and then uncommenting successive lines of code
Be patient. Don’t be hard are yourself. Remember, errors are portals of discovery.
Semantic Errors
Your code runs, but doesn’t produce what you expected.
Less common; can be harder to identify and fix
One example: Two packages have a function with the same name that do different things
# dplyr::summarize# Hmisc::summarize
Semantic Errors
Some general solutions/practices to avoid semantic errors:
Specify the package and the function you want: package_name::function_name()
Write helpful comments in your code.
Include “sanity” checks in your code.
If a function should produce an output that’s a data.frame, check to see if it is a data frame
# Here's some pseudo code:# I expect my_function produces a data framex <-my_function(y) # Check to see if x is a data frame# If x is not a data frame, return an Errorstopifnot(is.data.frame(x))
Data Wrangling in R
Why do we need to “wrangle” data
Rarely, if ever, do we get data in the exact format we need.
Instead, before we can get to work, we often need to transform our data in various ways
Sometimes called:
Data cleaning/recoding
Data wrangling
Data carpentry
The end goal is the same: make messy data tidy
Tidy data
Every column is a variable.
Every row is an observation.
Every cell is a single value
Tools for transforming our data
Last week we used the following functions:
read_csv() and data() to read and load data in R
logical operators like &, |, %in%==, !=, >,>=,<,<= to make comparisons
the pipe command %>% to “pipe” the output of one function into another
filter() to pick observations (rows) by their values
arrange() to reorder rows
select() to pick variables by their names
mutate() and case_when() command to create new variables in our data set
summarise() to collapse many values into a single value (like a mean or median)
group_by() to apply functions like mutate() and summarise() on a group-by-group basis
Common functions for transforming data
All of these “verb” functions from the dplyr package (e.g. filter(),mutate()) follow a similar format:
Their first argument is a data frame
The subsequent arguments tell R what to do with the data frame, using the variable names (without quotes)
To work with the Covid-19 data we did the following:
Subsetted/Filtered the data to exclude US Territories
Created new variables from existing variables in the data to use in our final analysis
Wrangling the Covid-19 data
Specifically, we did the following:
Created an object called territories that is a vector containing the names of U.S. territories
Created a new dataframe, called covid_us, by filtering out observations from the U.S. territories
Created a state variable that is a copy of the administrative_area_level_2
Created a variable called new_cases from the confirmed. Create a variable called new_cases_pc that is the number of new Covid-19 cases per 100,000 citizens
Created a variable called face_masks from the facial_coverings variable.
Calculated the average number of new cases, by different levels of face_masks
Let’s take some time to make sure we understand everything that was happening.
Use the filter() command to select only the rows where the administrative_area_level_2 is not (!) in (%in%) the territories object
# - 2. Create covid_us data frame# How many rows and columns in coviddim(covid)
[1] 58809 47
# Filter out obs from US territoriescovid_us <- covid %>%filter(!administrative_area_level_2 %in% territories)# covid_us should have fewer rows than coviddim(covid_us)
Take the difference between a given day’s value of confirmed and yesterday’s value of confirmed to create a measure of new_cases on a given date for each state
Note
Use lag() to shift values in a column down one row in the data
Use group_by() to respect the state-date structure of the data
# Check recodingcovid_us %>%# Look at two statesfilter(state =="Rhode Island"| state =="New York") %>%# In a small date rangefilter(date >"2021-01-01"& date <"2021-01-05") %>%# Select only the columns we wantselect(state, date, new_cases, new_cases_pc) -> hlo_df# save to object hlo_df
hlo_df
# A tibble: 6 × 4
# Groups: state [2]
state date new_cases new_cases_pc
<chr> <date> <int> <dbl>
1 Rhode Island 2021-01-02 0 0
2 Rhode Island 2021-01-03 0 0
3 Rhode Island 2021-01-04 4759 449.
4 New York 2021-01-02 15849 81.5
5 New York 2021-01-03 12232 62.9
6 New York 2021-01-04 11242 57.8
Create a variable called face_masks from the facial_coverings that describes the face mask policy experienced by most people in a given state on a given date.
Note
Use case_when() inside of mutate() to create a variable that takes certain values when certain logical statements are true
Seting the levels = c(value1, value2, etc.) argument in factor() lets us control the ordering of categorical/character data.
Recall, that the facial_coverings variable took on range of substantive values from 0 to 4, but empirically could take both positve and negative values
covid_us%>%filter(state =="Illinois", date >"2020-9-28") %>%select(state, date, facial_coverings, face_masks) %>%slice(1:5)
# A tibble: 5 × 4
# Groups: state [1]
state date facial_coverings face_masks
<chr> <date> <int> <fct>
1 Illinois 2020-09-29 2 Some requirements
2 Illinois 2020-09-30 2 Some requirements
3 Illinois 2020-10-01 -4 Required all times
4 Illinois 2020-10-02 -4 Required all times
5 Illinois 2020-10-03 -4 Required all times
A statistical graphic is a mapping of data variables to aes thetic attributes of geom etric objects.
At a minimum, a graphic contains three core components:
data: the dataset containing the variables of interest.
aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.
geom: the geometric object in question. This refers to the type of object we can observe in a plot For example: points, lines, and bars.
In R, we’ll implement this grammar of graphics using the ggplot package
Let’s take a look at your feedback to last week’s survey and see how we can visualize some of the in formation you provided
Feedback
What we liked
What we disliked
Building that figure
Look at the raw data
Recode the raw data
Make a basic plot, telling R the data, aesthetics, geometries, and statistics I want it to plot
Tinker with the data and plot’s scales, coordinates, labels and theme to make the figure look better
1. Look at the raw data
df$trip
<labelled<double>[12]>: You're on a road trip with friends. Who controls the music?
[1] NA 3 1 2 3 2 2 2 1 2 2 NA
Labels:
value
1
2
3
label
The driver, duh.
The front seat, of course
That jerk in the back who you don't even know but seems to have really strong feelings about Billy Joel's "Only the good die young"
# A tibble: 12 × 1
Playist
<fct>
1 <NA>
2 "That jerk in the back who you don't even know but seems to have really stro…
3 "The driver, duh."
4 "The front seat, of course"
5 "That jerk in the back who you don't even know but seems to have really stro…
6 "The front seat, of course"
7 "The front seat, of course"
8 "The front seat, of course"
9 "The driver, duh."
10 "The front seat, of course"
11 "The front seat, of course"
12 <NA>