POLS 1600

Data Visualization

Updated May 31, 2024

Overview

Class Plan

  • Announcments
  • Setup (5 minutes)
  • Review
    • Troubleshooting Errors (5 min)
    • Data wrangling in R (20 min)
    • Descriptive Statistics (10 min)
  • Data Visualization (40 min)
    • The grammar of graphics
    • Basic plots to describe:
      • Distributions
      • Associations

Announcements

  • Manuel’s office hours today

  • Paul’s office hours on Thursday

  • Submit tutorials from last week for full credit by this Sunday.

  • Groups for the course assigned next week

Setup

Setup for today

Libraries

This week we’ll use the following libraries.

the_packages <- c(
  ## R Markdown
  "tinytex", "kableExtra",
  
  ## Tidyverse
  "tidyverse","lubridate", "forcats", "haven","labelled",
  
  ## Extensions for ggplot
  "ggmap","ggrepel", "ggridges", "ggthemes","ggpubr",
  "GGally",
  
  # Data 
  "COVID19","maps","mapdata","DT"
)
the_packages
 [1] "tinytex"    "kableExtra" "tidyverse"  "lubridate"  "forcats"   
 [6] "haven"      "labelled"   "ggmap"      "ggrepel"    "ggridges"  
[11] "ggthemes"   "ggpubr"     "GGally"     "COVID19"    "maps"      
[16] "mapdata"    "DT"        

Installing and loading new packages

Next we’ll create a function called ipak (thanks Steven) which:

  • Takes a list of packages (pkg)
  • Checks to see if these packages are installed
  • Installs any new packages
  • Loads all the packages so we can use them
ipak <- function(pkg){
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg)) 
        install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
}

Again, run this code on your machines

Installing and loading new packages

Finally, let’s use ipak to install and load the_packages

What should we replace some_function and some_input with to do this?

some_function(some_input)
ipak(the_packages)
   tinytex kableExtra  tidyverse  lubridate    forcats      haven   labelled 
      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
     ggmap    ggrepel   ggridges   ggthemes     ggpubr     GGally    COVID19 
      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
      maps    mapdata         DT 
      TRUE       TRUE       TRUE 
  • R may ask you to install a package’s dependencies (other packages your package needs). Try entering the number 1 into your console
  • R may tell you need to restart R Try saying yes. If it doesn’t start downloading, say no
  • R may then ask if you want to compile some packages from source. Type Y into your console. If this doesn’t work, try again, but this time type N when asked

Loading the Covid-19 Data

Let’s load the Covid-19 data we worked with last week:

load(url("https://pols1600.paultesta.org/files/data/covid.rda"))

Troubleshooting Errors

XKCD

Two kinds of errors:

  • Syntactic
    • R doesn’t understand how to run your code
    • Most common, easy to fix (eventually…)
  • Semantic
    • R runs your code but doesn’t give you the expected result
    • Less common, harder to fix

Most errors happen because R is looking for something that isn’t there.

More discussion here and here

Common Syntactic Errors

  • Unmatched parentheses or brackets

  • Misspelled a name

  • Forgot a comma

  • Forgot to install a package or load a library

  • Forgot to set the working directory/path to a file you want R to use.

  • Tried to select a column or row that doesn’t exist

Fixing Syntactic Errors

  • R Studio’s script editor will show a red circle with a white x in next to a line of code it thinks has an error in it.

  • Have someone else look at your code (Fresh eyes, paired programming)

  • Copy and paste the “general part” of error message into Google.

  • Knit your document after each completed code chunk

    • This will run the code from top to bottom, and stop when it encounters an error
    • Try commenting out the whole chunk, and then uncommenting successive lines of code
  • Be patient. Don’t be hard are yourself. Remember, errors are portals of discovery.

Semantic Errors

  • Your code runs, but doesn’t produce what you expected.
  • Less common; can be harder to identify and fix
  • One example: Two packages have a function with the same name that do different things
# dplyr::summarize
# Hmisc::summarize

Semantic Errors

  • Some general solutions/practices to avoid semantic errors:
    • Specify the package and the function you want: package_name::function_name()
    • Write helpful comments in your code.
    • Include “sanity” checks in your code.
    • If a function should produce an output that’s a data.frame, check to see if it is a data frame
# Here's some pseudo code:

# I expect my_function produces a data frame
x <- my_function(y) 

# Check to see if x is a data frame
# If x is not a data frame, return an Error
stopifnot(is.data.frame(x))

Data Wrangling in R

Why do we need to “wrangle” data

  • Rarely, if ever, do we get data in the exact format we need.

  • Instead, before we can get to work, we often need to transform our data in various ways

  • Sometimes called:

    • Data cleaning/recoding
    • Data wrangling
    • Data carpentry
  • The end goal is the same: make messy data tidy

Tidy data

  • Every column is a variable.

  • Every row is an observation.

  • Every cell is a single value

Tools for transforming our data

Last week we used the following functions:

  • read_csv() and data() to read and load data in R

  • logical operators like &, |, %in% ==, !=, >,>=,<,<= to make comparisons

  • the pipe command %>% to “pipe” the output of one function into another

  • filter() to pick observations (rows) by their values

  • arrange() to reorder rows

  • select() to pick variables by their names

  • mutate() and case_when() command to create new variables in our data set

  • summarise() to collapse many values into a single value (like a mean or median)

  • group_by() to apply functions like mutate() and summarise() on a group-by-group basis

Common functions for transforming data

All of these “verb” functions from the dplyr package (e.g. filter(),mutate()) follow a similar format:

  1. Their first argument is a data frame
  2. The subsequent arguments tell R what to do with the data frame, using the variable names (without quotes)
  3. The output is a new data frame

More

You trying to get the %>%?

The pipe command %>%

  • The pipe command is way of “chaining” lines of code together, piping the results of one tidyverse function into the next function.

  • The pipe command works because these functions always expect a data frame as their first argument, and always produce a data frame as their output.

The pipe command %>%

summarise(
  data = df,
  mean = mean(var1, na.rm = T),
  median = median(var1, na.rm = T)
 )
# Rewrite with a pipe:

df %>% 
  summarize(
    mean = mean(var1, na.rm = T),
    median = median(var1, na.rm = T)    
  )

Wrangling the Covid-19 data

To work with the Covid-19 data we did the following:

  • Subsetted/Filtered the data to exclude US Territories
  • Created new variables from existing variables in the data to use in our final analysis

Wrangling the Covid-19 data

Specifically, we did the following:

  1. Created an object called territories that is a vector containing the names of U.S. territories
  2. Created a new dataframe, called covid_us, by filtering out observations from the U.S. territories
  3. Created a state variable that is a copy of the administrative_area_level_2
  4. Created a variable called new_cases from the confirmed. Create a variable called new_cases_pc that is the number of new Covid-19 cases per 100,000 citizens
  5. Created a variable called face_masks from the facial_coverings variable.
  6. Calculated the average number of new cases, by different levels of face_masks

Let’s take some time to make sure we understand everything that was happening.

Created an object called territories

# - 1. Create territories object

territories <- c(
  "American Samoa",
  "Guam",
  "Northern Mariana Islands",
  "Puerto Rico",
  "Virgin Islands"
  )
  • The object territories now exists in our environment.

Created a new dataframe, called covid_us

  • Use the filter() command to select only the rows where the administrative_area_level_2 is not (!) in (%in%) the territories object
# - 2. Create covid_us data frame
# How many rows and columns in covid
dim(covid)
[1] 58809    47
# Filter out obs from US territories
covid_us <- covid %>%
  filter(!administrative_area_level_2 %in% territories)

# covid_us should have fewer rows than covid
dim(covid_us)
[1] 53678    47

Created a variable called state

Copy administrative_area_level_2 into a new variable called state

Note

Note that we have to save the output of mutate back into covid_us for our state to exist as new column in covid_us

dim(covid_us)
[1] 53678    47
covid_us %>%
  mutate(
    state = administrative_area_level_2
  ) -> covid_us
dim(covid_us)
[1] 53678    48
names(covid_us)[48]
[1] "state"

Created a variable called state

Now there’s a new column in covid_us called state, that we can access by calling covid_us$state

covid_us$state[1:5] # Just show first 5 observations
[1] "Minnesota" "Minnesota" "Minnesota" "Minnesota" "Minnesota"

We could have done the same thing in “Base” R

covid_us$state <- covid_us$administrative_area_level_2

Why didn’t we?

  • Consistent preference for tidyverse > base R
  • Saves time when recoding lots of variables
  • mutate() plays nicely with functions like group_by()

Create a variable called new_cases from the confirmed variable

The confirmed variable contains a running total of confirmed cases in a given state on a given day.

Vizualing data helps us understand how we might need to transform our data

Visualize confirmed variable for Rhode Island

options(scipen = 999) # No scientific notation
covid_us %>% 
  filter(state == "Rhode Island") %>% 
  ggplot(aes(
    x = date,
    y = confirmed
  ))+
  geom_point()+
  theme_bw() +
  labs(title = "Total Covid-19 cases in Rhode Island",
       y = "Total Cases",
       x = "Date") -> fig_ri_covid

Create a variable called new_cases from the confirmed variable

Take the difference between a given day’s value of confirmed and yesterday’s value of confirmed to create a measure of new_cases on a given date for each state

Note

  • Use lag() to shift values in a column down one row in the data
  • Use group_by() to respect the state-date structure of the data
covid_us %>%
  dplyr::group_by(state) %>%
  mutate(
    new_cases = confirmed - lag(confirmed)
  ) -> covid_us

Create a variable called new_cases_pc

  • Scale new_cases by population to create a per capita measure (new_cases_pc)

Note

We can create multiple variables in a single mutate() by separating lines of code with a ,

covid_us %>%
  mutate(
    state = administrative_area_level_2,
  ) %>%
  dplyr::group_by(state) %>%
  mutate(
    new_cases = confirmed - lag(confirmed),
    new_cases_pc = new_cases / population *100000
    ) ->covid_us
# Check recoding
covid_us %>% 
  # Look at two states
  filter(state == "Rhode Island" | state == "New York") %>% 
  # In a small date range
  filter(date > "2021-01-01" & date < "2021-01-05") %>% 
  # Select only the columns we want
  select(state, date, new_cases, new_cases_pc) -> hlo_df
# save to object hlo_df
hlo_df
# A tibble: 6 × 4
# Groups:   state [2]
  state        date       new_cases new_cases_pc
  <chr>        <date>         <int>        <dbl>
1 Rhode Island 2021-01-02         0          0  
2 Rhode Island 2021-01-03         0          0  
3 Rhode Island 2021-01-04      4759        449. 
4 New York     2021-01-02     15849         81.5
5 New York     2021-01-03     12232         62.9
6 New York     2021-01-04     11242         57.8

Created a variable called face_masks

Create a variable called face_masks from the facial_coverings that describes the face mask policy experienced by most people in a given state on a given date.

Note

  • Use case_when() inside of mutate() to create a variable that takes certain values when certain logical statements are true
  • Seting the levels = c(value1, value2, etc.) argument in factor() lets us control the ordering of categorical/character data.

Recall, that the facial_coverings variable took on range of substantive values from 0 to 4, but empirically could take both positve and negative values

table(covid_us$facial_coverings)

   -4    -3    -2    -1     0     1     2     3     4 
  410  5897  7362   275  3893  8604 17424  9191   622 
covid_us %>%
mutate(
    face_masks = case_when(
      facial_coverings == 0 ~ "No policy",
      abs(facial_coverings) == 1 ~ "Recommended",
      abs(facial_coverings) == 2 ~ "Some requirements",
      abs(facial_coverings) == 3 ~ "Required shared places",
      abs(facial_coverings) == 4 ~ "Required all times",
    ) %>% factor(.,
      levels = c("No policy","Recommended",
                 "Some requirements",
                 "Required shared places",
                 "Required all times")
    ) 
    ) -> covid_us
covid_us%>%
  filter(state == "Illinois", date > "2020-9-28") %>%
  select(state, date, facial_coverings, face_masks) %>% 
  slice(1:5)
# A tibble: 5 × 4
# Groups:   state [1]
  state    date       facial_coverings face_masks        
  <chr>    <date>                <int> <fct>             
1 Illinois 2020-09-29                2 Some requirements 
2 Illinois 2020-09-30                2 Some requirements 
3 Illinois 2020-10-01               -4 Required all times
4 Illinois 2020-10-02               -4 Required all times
5 Illinois 2020-10-03               -4 Required all times

Addtional recoding

In last week’s lab, we also added the following

covid_us %>%
  mutate(
    year = year(date),
    month = month(date),
    year_month = paste(
      year, 
      str_pad(month, width = 2, pad=0), 
      sep = "-"
      ),
    percent_vaccinated = people_fully_vaccinated/population*100  
    ) -> covid_us

Working with dates

R treat’s dates differently

covid_us$date[1:3]
[1] "2020-01-01" "2020-01-02" "2020-01-03"
class(covid_us$date)
[1] "Date"

If R knows a variable is a date, we can extract components of that date, using functions from the lubridate package

year(covid_us$date[1:3])
[1] 2020 2020 2020
month(covid_us$date[1:3])
[1] 1 1 1

The str_pad() and paste() function

  • The str_pad() function lets us ‘pad’ strings so that they’re all the same width
month(covid_us$date[1:3])
[1] 1 1 1
str_pad(month(covid_us$date[1:3]), width=2, pad = 0)
[1] "01" "01" "01"
  • The paste function lets us paste objects together.
paste(year(covid_us$date[1:3]),
      str_pad(month(covid_us$date[1:3]), width=2, pad = 0),
      sep = "-"
      )
[1] "2020-01" "2020-01" "2020-01"

Summarizing the averge number of new_cases by face_mask policy

Calculate the mean (average) number of new_cases of Covid-19 when each type of face_mask policy was in effect

Note

  • The group_by() command will do each calculation inside of summarise() for each level of the grouping variable
covid_us %>%
  filter(!is.na(face_masks)) %>%
  group_by(face_masks) %>%
  summarize(
    new_cases_pc = mean(new_cases_pc, na.rm=T)
  ) -> face_mask_summary
face_mask_summary
# A tibble: 5 × 2
  face_masks             new_cases_pc
  <fct>                         <dbl>
1 No policy                      10.3
2 Recommended                    16.6
3 Some requirements              36.2
4 Required shared places         29.4
5 Required all times             32.2

Summarizing the averge number of new_cases by face_mask policy by month

Calculate the mean (average) number of new_cases of Covid-19 when each type of face_mask policy was in effect for each year_month in our dataset

Note

  • The group_by() command can group on multiple variables
covid_us %>%
  group_by(face_masks, year_month) %>%
  summarize(
    new_cases_pc = mean(new_cases_pc, na.rm=T)
  ) -> cases_by_month_and_policy
cases_by_month_and_policy
# A tibble: 102 × 3
# Groups:   face_masks [5]
   face_masks year_month new_cases_pc
   <fct>      <chr>             <dbl>
 1 No policy  2020-01        0.000463
 2 No policy  2020-02        0.00188 
 3 No policy  2020-03        1.70    
 4 No policy  2020-04        6.50    
 5 No policy  2022-04       19.8     
 6 No policy  2022-05       20.4     
 7 No policy  2022-06       37.6     
 8 No policy  2022-07       36.2     
 9 No policy  2022-08       35.7     
10 No policy  2022-09       19.0     
# ℹ 92 more rows
# In base R:
mean(
  covid_us$new_cases_pc[
    covid_us$face_masks == "No policy" &
      covid_us$year_month == "2020-01"], na.rm = T)
[1] 0.0004626161

Concept check

Suppose you want to do the following, what function or functions would you use:

  • Read data into R
  • Look at the data to get a high level overview of its structure
  • Subset or filter the data to include just observations with certain values
  • Select specific columns from data
  • Add new columns to the data
  • Summarize multiple values by collapsing them into a single value
  • Doing some function group-by-group?

Concept check

Suppose you want to do the following, what function or functions would you use:

  • Read data into R
    • read_xxx() (tidy), read.xxx() (base)
  • Look at the data to get a high level overview of its structure
    • head(), tail(), glimpse(), table(), summary(), View()
  • Subset the data to include just obersvations with certain values
    • data %>% filter(x > 0), data[data$x > 0], subset(data, x > 0)
  • Select specific columns from data
    • data$variable, data %>% select(variable1, variable2), data[,c("x1","x2")]
  • Add new columns to the data
    • data %>% mutate(x = y/10) data$x <- data$y/10
  • Summarize multiple values by collapsing them into a single value
    • data %>% summarise(x_mn = mean(x, na.rm=T))
  • Doing some function group-by-group?
    • data %>% group_by(g) %>% summarise(x_mn = mean(x, na.rm=T))

Concept check

Should you know exactly how to do all of this?

NO! Of course not. For Pete’s sake, Paul, It’s only the second week

Will you learn how to do much of this?

Maybe, but I’m feeling pretty overwhelmed…

How will you learn how do these things?

With lots of practice, patience, and repetition motivated by a sense that these skills will help me learn about things I care about

Advice on learning how to code

  • It takes lots of practice and lots of errors
    • Break long blocks of code into individual steps to see what’s happening
  • Create code chunks and FAFO
    • Just clean up when you’re done…
  • Only dumb question is one you don’t ask
  • Google, Stack Exchange are your friends
  • Try writing out in comments what you want to do in code
  • Learn to recognize patterns in the questions/tasks I give you:
    • Copy and paste code I give
    • Change one thing
    • Fix the error
    • Adapt code from class to do a similar thing
  • Learning to code is much less painful when you have a reason to do it
    • Let me know what interests you

Descriptive Statistics

Descriptive Statistics

When social scientists talk about descriptive inference, we’re trying to summarize our data and make claims about what’s typical of our data

  • What’s a typical value
    • Measures of central tendency
    • mean, median, mode
  • How do our data vary around typical values
    • Measures of dispersion
    • variance, standard deviation, range, percentile ranges
  • How does variation in one variable relate to variation in another
    • Measures of association
    • covariance, correlation

Using R to Summarize Data

Here are some common ways of summarizing data and how to calculate them with R

Description Usage
sum sum(x)
minimum min(x)
maximum max(x)
range range(x)
mean mean(x)
median median(x)
percentile quantile(x)
variance var(x)
standard deviation sd(x)
rank rank(x)

All of these functions have an argument called na.rm=F. If your data have missing values, you’ll need to set na.rm=F (e.g. mean(x, na.rm=T))

What you need to know for POLS 1600

Measures of typical values

  • Means (mean()) all the time
  • Medians (median()) useful for describing distributions of variables particularly those with extreme values
  • Mode useful for characterizing categorical data

What you need to know for POLS 1600

Measures of typical variation

  • var() important for quantifying uncertainty, but rarely will you be calculating this directly
  • sd() a good summary of a typical change in the data.
  • range(), min(), max() useful for exploring data, detecting outliers and potential values that need to be recoded

What you need to know for POLS 1600

Measures of association

  • Covariance (var()) central to describing relationships but generally not something you’ll calculate or interpret directly
  • Correlation (cor()) useful for describing [bivariate] relationships (positive or negative relationships).

What you don’t really need to know for POLS 1600 {smaller}

We won’t spend much time on the formal definitions, math, and proofs

\[ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i \]

\[ M_x = X_i : \int_{-\infty}^{x_i} f_x(X)dx=\int_{x_i}^\infty f_x(X)dx=1/2 \]

Useful eventually. Not necessary right now.

Data Visualization: The Grammar of Graphics

Data visualizaiton

Data visualization is an incredibly valuable tool that helps us to

  • Explore data, uncovering new relationships, as well as potential problems
  • Communicate our results clearly and precisely

Take a look at how the BBC uses R to produce its graphics

Data visualization

Today, we will:

  • Introduce the grammar of graphics
  • Learn how to apply this grammar with ggplot()
  • Introduce basic plots to describe
    • Univariate distributions
    • Bivariate relations

The Grammar of Graphics

Inspired by Wilkinson (2005)

A statistical graphic is a mapping of data variables to aes thetic attributes of geom etric objects.

At a minimum, a graphic contains three core components:

  • data: the dataset containing the variables of interest.
  • aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.
  • geom: the geometric object in question. This refers to the type of object we can observe in a plot For example: points, lines, and bars.

Ismay and Kim (2022)

Seven Layers of Graphics

Kesari (2018)

The grammar of graphics in R

In R, we’ll implement this grammar of graphics using the ggplot package

  • Let’s take a look at your feedback to last week’s survey and see how we can visualize some of the in formation you provided

Feedback

What we liked

What we disliked

Building that figure

  1. Look at the raw data
  2. Recode the raw data
  3. Make a basic plot, telling R the data, aesthetics, geometries, and statistics I want it to plot
  4. Tinker with the data and plot’s scales, coordinates, labels and theme to make the figure look better

1. Look at the raw data

df$trip
<labelled<double>[12]>: You're on a road trip with friends. Who controls the music?
 [1] NA  3  1  2  3  2  2  2  1  2  2 NA

Labels:
 value
     1
     2
     3
                                                                                                                               label
                                                                                                                    The driver, duh.
                                                                                                           The front seat, of course
 That jerk in the back who you don't even know but seems to have really strong feelings about Billy Joel's "Only the good die young"

2. Recode the raw data

df %>%
  mutate(
    Playist = forcats::as_factor(trip)
 )%>%
  select(Playist)
# A tibble: 12 × 1
   Playist                                                                      
   <fct>                                                                        
 1  <NA>                                                                        
 2 "That jerk in the back who you don't even know but seems to have really stro…
 3 "The driver, duh."                                                           
 4 "The front seat, of course"                                                  
 5 "That jerk in the back who you don't even know but seems to have really stro…
 6 "The front seat, of course"                                                  
 7 "The front seat, of course"                                                  
 8 "The front seat, of course"                                                  
 9 "The driver, duh."                                                           
10 "The front seat, of course"                                                  
11 "The front seat, of course"                                                  
12  <NA>                                                                        

3. Make a basic plot

#|
df %>% #<< Raw data
  mutate(
    Playlist =forcats::as_factor(trip)
  ) %>% # Transformed data
  ggplot(aes(x = Playlist, # Aesthetics
             fill = Playlist))+
  geom_bar( # Geometry
    stat = "count" # Statistic
    ) -> fig_roadtrip
fig_roadtrip

4.1 Tinker with data

df %>%
  filter(!is.na(trip)) %>% 
  mutate(
    Playlist =str_wrap(forcats::as_factor(trip),20)
  ) %>%
  ggplot(aes(x = Playlist,
             fill = Playlist))+
  geom_bar(stat = "count") -> fig_roadtrip

4.2 Tinker with fill aesthetic

df %>%
  filter(!is.na(trip)) %>% 
  mutate(
    Playlist =str_wrap(forcats::as_factor(trip),20)
  ) %>%
  ggplot(aes(x = Playlist,
             fill = Playlist))+
  geom_bar(stat = "count")+
  scale_fill_brewer() -> fig_roadtrip

4.3 Tinker with coordinates

df %>%
  filter(!is.na(trip)) %>% 
  mutate(
    Playlist =str_wrap(forcats::as_factor(trip),20)
  ) %>%
  ggplot(aes(x = Playlist,
             fill = Playlist))+
  geom_bar(stat = "count")+
  scale_fill_brewer() +
  coord_flip() -> fig_roadtrip

4.4 Tinker with labels

df %>%
  filter(!is.na(trip)) %>% 
  mutate(
    Playlist =str_wrap(forcats::as_factor(trip),20)
  ) %>%
  ggplot(aes(x = Playlist,
             fill = Playlist))+
  geom_bar(stat = "count")+
  scale_fill_brewer(guide="none")+
  coord_flip()+
  labs(title = "Who controls the playlist",
       x= "",
       y = "")-> fig_roadtrip

4.4 Tinker with theme

df %>%
  filter(!is.na(trip)) %>% 
  mutate(
    Playlist =str_wrap(forcats::as_factor(trip),20)
  ) %>%
  ggplot(aes(x = Playlist,
             fill = Playlist))+
  geom_bar(stat = "count")+
   scale_fill_brewer(guide="none")+
  coord_flip()+
  labs(title = "Who controls the playlist",
       x= "",
       y = "")+
  theme_bw() -> fig_roadtrip

The final code

df %>%
  filter(!is.na(trip)) %>% 
  mutate(
    Playlist =str_wrap(forcats::as_factor(trip),20)
  ) %>%
  ggplot(aes(x = Playlist,
             fill = Playlist))+
  geom_bar(stat = "count")+
   scale_fill_brewer(guide="none")+
  coord_flip()+
  labs(title = "Who controls the playlist",
       x= "",
       y = "")+
  theme_bw() -> fig_roadtrip

DataViz: Describing Distributions and Associations

Describing Distributions and Associations

  • In the remaining slides, we’ see how to visualize some distributions and associations in the Covid data using:

    • barplots
    • histograms
    • density plots
    • boxplots
    • line plots
    • scatter plots

General advice for making figures

  • Think through conceptually how you want to figure to look
    • Draw it out by hand
  • Make a basic plot and iterate
  • Use summarize() and other data wrangling skills to transform data for plotting
  • Use factor() and related functions to control order of labels on axis
  • Use google to figure out arcane options of ggplot
  • Don’t let the perfect be the enemy of the good

Barplots

What was the most common face mask policy in the data?

covid_us %>% 
  ggplot(aes(x=face_masks))+
  geom_bar(stat = "count")

covid_us %>%
  ungroup() %>% 
  mutate(
    face_masks = forcats::fct_infreq(face_masks)
  ) %>% 
  ggplot(aes(x=face_masks,
             fill = face_masks))+
  geom_bar()+
  geom_text(stat='count', aes(label=..count..), 
            hjust=.5,vjust=-.5)+
  guides(fill = "none")+
  theme_bw()+
  labs(
    x = "Face Mask Policy ",
    title = ""
  ) -> fig_barplot

Histogram

What does the distribution of new Covid-19 cases look like in June 2021

covid_us %>% 
  filter(year_month == "2021-06") %>% 
  ggplot(aes(x=new_cases))+
  geom_histogram() -> fig_hist1
fig_hist1

covid_us %>%
  filter(year_month == "2021-06") %>% 
  filter(new_cases > 0) %>% 
  ggplot(aes(x=new_cases))+
  geom_histogram() +
  labs(
    title = "Exclude Negative Values"
  ) -> fig_hist2a

covid_us %>%
  filter(year_month == "2021-06") %>% 
  filter(new_cases > 0) %>% 
  ggplot(aes(x=new_cases))+
  geom_histogram() +
  scale_x_log10()+
  labs(
    title = "Exclude Negative Values & Use log scale"
  ) -> fig_hist2b

fig_hist2 <- ggarrange(fig_hist2a, fig_hist2b)

Density Plots

What does the distribution of Covid-19 deaths look like?

covid_us %>% 
  mutate(
    new_deaths = deaths - lag(deaths),
    new_deaths_pc = deaths - lag(deaths)
  ) %>% 
  filter(new_deaths > 0) %>% 
  ggplot(aes(x=new_deaths_pc))+
  geom_density() -> fig_density1
fig_density1

covid_us %>% 
  mutate(
    new_deaths = deaths - lag(deaths),
    new_deaths_pc = deaths - lag(deaths),
    year_f = factor(year)
  ) %>% 
  filter(new_deaths > 0) %>% 
  ggplot(aes(x=new_deaths_pc,
             col = year_f))+
  geom_density() +
  geom_rug() +
  scale_x_log10() +
    facet_wrap(~month)+
  theme(legend.position = "bottom")-> 
  fig_density2

Box plots

How did the distribution of Covid-19 cases vary by face mask policy?

covid_us %>%
  filter(new_cases_pc > 0) %>% 
  ggplot(aes(x= face_masks, y=new_cases_pc))+
  scale_y_log10()+
  geom_boxplot() -> fig_boxplot1
fig_boxplot1

covid_us %>%
  mutate(
    Month = lubridate::month(date, label = T)
  ) %>% 
  filter(new_cases_pc > 0) %>% 
  filter(year == 2020) %>% 
 ggplot(aes(x= face_masks, 
            y=new_cases_pc,
            col = face_masks))+
  scale_y_log10()+
  coord_flip() +
  geom_boxplot()  +
    facet_wrap(~Month) +
  theme(
    legend.position = "bottom"
  )-> fig_boxplot2

Line graphs

How did vaccination rates vary by state?

covid_us %>%
  ggplot(
    aes(x= date,
        y=percent_vaccinated,
        group = state
        ))+
  geom_line() -> fig_line1
fig_line1

covid_us %>%
  ungroup() %>%
  mutate(
    Label = case_when(
      date == max(date) & percent_vaccinated == max(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
      date == max(date) & percent_vaccinated == median(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
      date == max(date) & percent_vaccinated == min(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
      TRUE ~ NA_character_
    ),
    line_alpha = case_when(
      state %in% c("District of Columbia", "Nebraska", "Wyoming") ~ 1,
      T ~ .3
    ),
    line_col = case_when(
      state %in% c("District of Columbia", "Nebraska", "Wyoming") ~ "black",
      T ~ "grey"
    )
  ) %>%
  ggplot(
    aes(x= date,
        y=percent_vaccinated,
        group = state
        ))+
  geom_line(
    aes(alpha = line_alpha,
        col =line_col)) +
  geom_text_repel(aes(label = Label),
                  direction = "x",
                  nudge_y = 2) +
  guides(
    alpha = "none",
    col = "none"
  )+
  xlim(ym("2021-01"), ym("2023-01")) +
  labs(
    y = "Percent Vacinated",
    x = "Date"
  ) +
  theme_bw()-> fig_line2

Scatterplots

What’s the relationship between vaccination rates and new cases of Covid-19?

covid_us %>%
  ggplot(
    aes(x= percent_vaccinated,
        y=new_cases_pc,
        ))+
  geom_point() -> fig_scatter1
fig_scatter1

covid_us %>%
  filter(year > 2020) %>%
  filter(month == 6) %>%
  filter(new_cases_pc > 0) %>%
  ggplot(
    aes(x= percent_vaccinated,
        y=new_cases_pc,
        ))+
  geom_point() +
  geom_smooth(method = "lm")+
  facet_wrap(~year_month,ncol =1,
             scales = "free_y")-> fig_scatter2

Summary

Summary

  • The grammar of graphics provides a language for translating data into figures

  • At a minimum figures with ggplot() require three things:

    • data
    • aesthetic mappings
    • geometries
  • To produce a figure:

    • think about what the end product will look like
    • transform your data
    • map variables onto corresponding aesthetics
    • tell R what to do with these aesthetic mappings
    • Revise and iterate!
  • Learning to code is hard, but the more errors you make now, the easier your life will be in the future