Data Visualization
Updated Apr 22, 2025
Suppose you want to do the following, what function or functions would you use:
RSubmit tutorials from last week for full credit by this Sunday.
Groups for the course assigned next week
Once you’ve done the following
You can see the available problem sets by running the following code in your console:
And start a specific tutorial by running:
Important
Please upload tutorials 00-intro and 01-measurement1 to Canvas by Friday
This week we’ll use the following libraries.
the_packages <- c(
## R Markdown
"tinytex", "kableExtra",
## Tidyverse
"tidyverse","lubridate", "forcats", "haven","labelled",
## Extensions for ggplot
"ggmap","ggrepel", "ggridges", "ggthemes","ggpubr",
"GGally",
# Data
"maps","mapdata","DT"
)
the_packages [1] "tinytex" "kableExtra" "tidyverse" "lubridate" "forcats"
[6] "haven" "labelled" "ggmap" "ggrepel" "ggridges"
[11] "ggthemes" "ggpubr" "GGally" "maps" "mapdata"
[16] "DT"
Next we’ll create a function called ipak (thanks Steven) which:
pkg)Again, run this code on your machines
Finally, let’s use ipak to install and load the_packages
What should we replace some_function and some_input with to do this?
tinytex kableExtra tidyverse lubridate forcats haven labelled
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
ggmap ggrepel ggridges ggthemes ggpubr GGally maps
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
mapdata DT
TRUE TRUE
R may ask you to install a package’s dependencies (other packages your package needs). Try entering the number 1 into your consoleR may tell you need to restart R Try saying yes. If it doesn’t start downloading, say noR may then ask if you want to compile some packages from source. Type Y into your console. If this doesn’t work, try again, but this time type N when askedLet’s load the Covid-19 data we worked with last week:
Unmatched parentheses or brackets
Misspelled a name
Forgot a comma
Forgot to install a package or load a library
Forgot to set the working directory/path to a file you want R to use.
Tried to select a column or row that doesn’t exist
R Studio’s script editor will show a red circle with a white x in next to a line of code it thinks has an error in it.
Have someone else look at your code (Fresh eyes, paired programming)
Copy and paste the “general part” of error message into Google/ChatGPT
Knit your document after each completed code chunk
Be patient. Don’t be hard are yourself. Remember, errors are portals of discovery.
package_name::function_name()Rarely, if ever, do we get data in the exact format we need.
Instead, before we can get to work, we often need to transform our data in various ways
Sometimes called:
The end goal is the same: make messy data tidy
Every column is a variable.
Every row is an observation.
Every cell is a single value
Last week we used the following functions:
read_csv() and data() to read and load data in R
logical operators like &, |, %in% ==, !=, >,>=,<,<= to make comparisons
the pipe command %>% to “pipe” the output of one function into another
filter() to pick observations (rows) by their values
arrange() to reorder rows
select() to pick variables by their names
mutate() and case_when() command to create new variables in our data set
summarise() to collapse many values into a single value (like a mean or median)
group_by() to apply functions like mutate() and summarise() on a group-by-group basis
All of these “verb” functions from the dplyr package (e.g. filter(),mutate()) follow a similar format:
%>%?%>%The pipe command is way of “chaining” lines of code together, piping the results of one tidyverse function into the next function.
The pipe command works because these functions always expect a data frame as their first argument, and always produce a data frame as their output.
%>%summarise( data = df, mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) ) # Rewrite with a pipe: df %>% summarize( mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) )summarise( data = df, mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) ) # Rewrite with a pipe: df %>% summarize( mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) )
To work with the Covid-19 data we did the following:
Specifically, we did the following:
territories that is a vector containing the names of U.S. territoriescovid_us, by filtering out observations from the U.S. territoriesstate variable that is a copy of the administrative_area_level_2new_cases from the confirmed. Create a variable called new_cases_pc that is the number of new Covid-19 cases per 100,000 citizensface_masks from the facial_coverings variable.face_masksLet’s take some time to make sure we understand everything that was happening.
territoriesterritories now exists in our environment.covid_usfilter() command to select only the rows where the administrative_area_level_2 is not (!) in (%in%) the territories objectstateCopy administrative_area_level_2 into a new variable called state
Note
Note that we have to save the output of mutate back into covid_us for our state to exist as new column in covid_us
stateNow there’s a new column in covid_us called state, that we can access by calling covid_us$state
[1] "Minnesota" "Minnesota" "Minnesota" "Minnesota" "Minnesota"
We could have done the same thing in “Base” R
Why didn’t we?
tidyverse > base Rmutate() plays nicely with functions like group_by()new_cases from the confirmed variableThe confirmed variable contains a running total of confirmed cases in a given state on a given day.
Vizualing data helps us understand how we might need to transform our data
confirmed variable for Rhode Islandnew_cases from the confirmed variableTake the difference between a given day’s value of confirmed and yesterday’s value of confirmed to create a measure of new_cases on a given date for each state
Note
lag() to shift values in a column down one row in the datagroup_by() to respect the state-date structure of the datanew_cases_pcnew_cases by population to create a per capita measure (new_cases_pc)Note
We can create multiple variables in a single mutate() by separating lines of code with a ,
face_masksCreate a variable called face_masks from the facial_coverings that describes the face mask policy experienced by most people in a given state on a given date.
Note
case_when() inside of mutate() to create a variable that takes certain values when certain logical statements are truelevels = c(value1, value2, etc.) argument in factor() lets us control the ordering of categorical/character data.Recall, that the facial_coverings variable took on range of substantive values from 0 to 4, but empirically could take both positve and negative values
covid_us %>%
mutate(
face_masks = case_when(
facial_coverings == 0 ~ "No policy",
abs(facial_coverings) == 1 ~ "Recommended",
abs(facial_coverings) == 2 ~ "Some requirements",
abs(facial_coverings) == 3 ~ "Required shared places",
abs(facial_coverings) == 4 ~ "Required all times",
) %>% factor(.,
levels = c("No policy","Recommended",
"Some requirements",
"Required shared places",
"Required all times")
)
) -> covid_uscovid_us%>%
filter(state == "Illinois", date > "2020-9-28") %>%
select(state, date, facial_coverings, face_masks) %>%
slice(1:5)# A tibble: 5 × 4
# Groups: state [1]
state date facial_coverings face_masks
<chr> <date> <int> <fct>
1 Illinois 2020-09-29 2 Some requirements
2 Illinois 2020-09-30 2 Some requirements
3 Illinois 2020-10-01 -4 Required all times
4 Illinois 2020-10-02 -4 Required all times
5 Illinois 2020-10-03 -4 Required all times
In last week’s lab, we also added the following
R treats dates differently
If R knows a variable is a date, we can extract components of that date, using functions from the lubridate package
str_pad() and paste() functionstr_pad() function lets us ‘pad’ strings so that they’re all the same width[1] 1 1 1
[1] "01" "01" "01"
paste function lets us paste objects together.new_cases by face_mask policyCalculate the mean (average) number of new_cases of Covid-19 when each type of face_mask policy was in effect
Note
group_by() command will do each calculation inside of summarise() for each level of the grouping variablenew_cases by face_mask policy by monthCalculate the mean (average) number of new_cases of Covid-19 when each type of face_mask policy was in effect for each year_month in our dataset
Note
group_by() command can group on multiple variables# A tibble: 102 × 3
# Groups: face_masks [5]
face_masks year_month new_cases_pc
<fct> <chr> <dbl>
1 No policy 2020-01 0.000463
2 No policy 2020-02 0.00188
3 No policy 2020-03 1.70
4 No policy 2020-04 6.50
5 No policy 2022-04 19.8
6 No policy 2022-05 20.4
7 No policy 2022-06 37.6
8 No policy 2022-07 36.2
9 No policy 2022-08 35.7
10 No policy 2022-09 19.0
# ℹ 92 more rows
# In base R:
mean(
covid_us$new_cases_pc[
covid_us$face_masks == "No policy" &
covid_us$year_month == "2020-01"], na.rm = T)[1] 0.0004626161
Suppose you want to do the following, what function or functions would you use:
RSuppose you want to do the following, what function or functions would you use:
R
read_xxx() (tidy), read.xxx() (base)head(), tail(), glimpse(), table(), summary(), View()data %>% filter(x > 0), data[data$x > 0], subset(data, x > 0)data$variable, data %>% select(variable1, variable2), data[,c("x1","x2")]data %>% mutate(x = y/10) data$x <- data$y/10data %>% summarise(x_mn = mean(x, na.rm=T))data %>% group_by(g) %>% summarise(x_mn = mean(x, na.rm=T))Should you know exactly how to do all of this?
NO! Of course not. For Pete’s sake, Paul, It’s only the second week
Will you learn how to do much of this?
Maybe, but I’m feeling pretty overwhelmed…
How will you learn how do these things?
With lots of practice, patience, and repetition motivated by a sense that these skills will help me learn about things I care about
When social scientists talk about descriptive inference, we’re trying to summarize our data and make claims about what’s typical of our data
Here are some common ways of summarizing data and how to calculate them with R
| Description | Usage |
|---|---|
| sum | sum(x) |
| minimum | min(x) |
| maximum | max(x) |
| range | range(x) |
| mean | mean(x) |
| median | median(x) |
| percentile | quantile(x) |
| variance | var(x) |
| standard deviation | sd(x) |
| rank | rank(x) |
All of these functions have an argument called na.rm=F. If your data have missing values, you’ll need to set na.rm=F (e.g. mean(x, na.rm=T))
Measures of typical values
mean()) all the timemedian()) useful for describing distributions of variables particularly those with extreme valuesMeasures of typical variation
var() important for quantifying uncertainty, but rarely will you be calculating this directlysd() a good summary of a typical change in the data.range(), min(), max() useful for exploring data, detecting outliers and potential values that need to be recodedMeasures of association
var()) central to describing relationships but generally not something you’ll calculate or interpret directlycor()) useful for describing [bivariate] relationships (positive or negative relationships).We won’t spend much time on the formal definitions, math, and proofs
ˉx=1nn∑i=1xi
Mx=Xi:∫xi−∞fx(X)dx=∫∞xifx(X)dx=1/2
Useful eventually. Not necessary right now.
Data visualization is an incredibly valuable tool that helps us to
Take a look at how the BBC uses R to produce its graphics
Today, we will:
grammar of graphicsggplot()Inspired by Wilkinson (2005)
A statistical graphic is a mapping of
datavariables toaesthetic attributes ofgeometric objects.
At a minimum, a graphic contains three core components:
data: the dataset containing the variables of interest.aes: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.geom: the geometric object in question. This refers to the type of object we can observe in a plot For example: points, lines, and bars.In R, we’ll implement this grammar of graphics using the ggplot package
data, aesthetics, geometries, and statistics I want it to plot<labelled<double>[12]>: You're on a road trip with friends. Who controls the music?
[1] NA 3 1 2 3 2 2 2 1 2 2 NA
Labels:
value
1
2
3
label
The driver, duh.
The front seat, of course
That jerk in the back who you don't even know but seems to have really strong feelings about Billy Joel's "Only the good die young"
# A tibble: 12 × 1
Playist
<fct>
1 <NA>
2 "That jerk in the back who you don't even know but seems to have really stro…
3 "The driver, duh."
4 "The front seat, of course"
5 "That jerk in the back who you don't even know but seems to have really stro…
6 "The front seat, of course"
7 "The front seat, of course"
8 "The front seat, of course"
9 "The driver, duh."
10 "The front seat, of course"
11 "The front seat, of course"
12 <NA>
df %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count") -> fig_roadtripdf %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count") -> fig_roadtrip

fill aestheticdf %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count")+ scale_fill_brewer(guide="none")+ coord_flip()+ labs(title = "Who controls the playlist", x= "", y = "")-> fig_roadtripdf %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count")+ scale_fill_brewer(guide="none")+ coord_flip()+ labs(title = "Who controls the playlist", x= "", y = "")-> fig_roadtrip

In the remaining slides, we’ see how to visualize some distributions and associations in the Covid data using:
summarize() and other data wrangling skills to transform data for plottingfactor() and related functions to control order of labels on axisggplotWhat was the most common face mask policy in the data?
covid_us %>%
ungroup() %>%
mutate(
face_masks = forcats::fct_infreq(face_masks)
) %>%
ggplot(aes(x=face_masks,
fill = face_masks))+
geom_bar()+
geom_text(stat='count', aes(label=..count..),
hjust=.5,vjust=-.5)+
guides(fill = "none")+
theme_bw()+
labs(
x = "Face Mask Policy ",
title = ""
) -> fig_barplot
What does the distribution of new Covid-19 cases look like in June 2021
covid_us %>%
filter(year_month == "2021-06") %>%
filter(new_cases > 0) %>%
ggplot(aes(x=new_cases))+
geom_histogram() +
labs(
title = "Exclude Negative Values"
) -> fig_hist2a
covid_us %>%
filter(year_month == "2021-06") %>%
filter(new_cases > 0) %>%
ggplot(aes(x=new_cases))+
geom_histogram() +
scale_x_log10()+
labs(
title = "Exclude Negative Values & Use log scale"
) -> fig_hist2b
fig_hist2 <- ggarrange(fig_hist2a, fig_hist2b)
What does the distribution of Covid-19 deaths look like?
covid_us %>%
mutate(
new_deaths = deaths - lag(deaths),
new_deaths_pc = deaths - lag(deaths),
year_f = factor(year)
) %>%
filter(new_deaths > 0) %>%
ggplot(aes(x=new_deaths_pc,
col = year_f))+
geom_density() +
geom_rug() +
scale_x_log10() +
facet_wrap(~month)+
theme(legend.position = "bottom")->
fig_density2
How did the distribution of Covid-19 cases vary by face mask policy?
covid_us %>%
mutate(
Month = lubridate::month(date, label = T)
) %>%
filter(new_cases_pc > 0) %>%
filter(year == 2020) %>%
ggplot(aes(x= face_masks,
y=new_cases_pc,
col = face_masks))+
scale_y_log10()+
coord_flip() +
geom_boxplot() +
facet_wrap(~Month) +
theme(
legend.position = "bottom"
)-> fig_boxplot2
How did vaccination rates vary by state?
covid_us %>%
ungroup() %>%
mutate(
Label = case_when(
date == max(date) & percent_vaccinated == max(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
date == max(date) & percent_vaccinated == median(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
date == max(date) & percent_vaccinated == min(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
TRUE ~ NA_character_
),
line_alpha = case_when(
state %in% c("District of Columbia", "Nebraska", "Wyoming") ~ 1,
T ~ .3
),
line_col = case_when(
state %in% c("District of Columbia", "Nebraska", "Wyoming") ~ "black",
T ~ "grey"
)
) %>%
ggplot(
aes(x= date,
y=percent_vaccinated,
group = state
))+
geom_line(
aes(alpha = line_alpha,
col =line_col)) +
geom_text_repel(aes(label = Label),
direction = "x",
nudge_y = 2) +
guides(
alpha = "none",
col = "none"
)+
xlim(ym("2021-01"), ym("2023-01")) +
labs(
y = "Percent Vacinated",
x = "Date"
) +
theme_bw()-> fig_line2
What’s the relationship between vaccination rates and new cases of Covid-19?

The grammar of graphics provides a language for translating data into figures
At a minimum figures with ggplot() require three things:
To produce a figure:
Learning to code is hard, but the more errors you make now, the easier your life will be in the future

POLS 1600