Data Visualization
Updated Mar 3, 2025
Suppose you want to do the following, what function or functions would you use:
R
Submit tutorials from last week for full credit by this Sunday.
Groups for the course assigned next week
Once you’ve done the following
You can see the available problem sets by running the following code in your console:
And start a specific tutorial by running:
Important
Please upload tutorials 00-intro and 01-measurement1 to Canvas by Friday
This week we’ll use the following libraries.
the_packages <- c(
## R Markdown
"tinytex", "kableExtra",
## Tidyverse
"tidyverse","lubridate", "forcats", "haven","labelled",
## Extensions for ggplot
"ggmap","ggrepel", "ggridges", "ggthemes","ggpubr",
"GGally",
# Data
"maps","mapdata","DT"
)
the_packages
[1] "tinytex" "kableExtra" "tidyverse" "lubridate" "forcats"
[6] "haven" "labelled" "ggmap" "ggrepel" "ggridges"
[11] "ggthemes" "ggpubr" "GGally" "maps" "mapdata"
[16] "DT"
Next we’ll create a function called ipak
(thanks Steven) which:
pkg
)Again, run this code on your machines
Finally, let’s use ipak
to install and load the_packages
What should we replace some_function
and some_input
with to do this?
tinytex kableExtra tidyverse lubridate forcats haven labelled
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
ggmap ggrepel ggridges ggthemes ggpubr GGally maps
TRUE TRUE TRUE TRUE TRUE TRUE TRUE
mapdata DT
TRUE TRUE
R
may ask you to install a package’s dependencies (other packages your package needs). Try entering the number 1
into your consoleR
may tell you need to restart R
Try saying yes. If it doesn’t start downloading, say noR
may then ask if you want to compile some packages from source. Type Y
into your console. If this doesn’t work, try again, but this time type N
when askedLet’s load the Covid-19 data we worked with last week:
Unmatched parentheses or brackets
Misspelled a name
Forgot a comma
Forgot to install a package or load a library
Forgot to set the working directory/path to a file you want R to use.
Tried to select a column or row that doesn’t exist
R Studio’s script editor will show a red circle with a white x in next to a line of code it thinks has an error in it.
Have someone else look at your code (Fresh eyes, paired programming)
Copy and paste the “general part” of error message into Google/ChatGPT
Knit your document after each completed code chunk
Be patient. Don’t be hard are yourself. Remember, errors are portals of discovery.
package_name::function_name()
Rarely, if ever, do we get data in the exact format we need.
Instead, before we can get to work, we often need to transform our data in various ways
Sometimes called:
The end goal is the same: make messy data tidy
Every column is a variable.
Every row is an observation.
Every cell is a single value
Last week we used the following functions:
read_csv()
and data()
to read and load data in R
logical operators like &
, |
, %in%
==
, !=
, >
,>=
,<
,<=
to make comparisons
the pipe command %>%
to “pipe” the output of one function into another
filter()
to pick observations (rows) by their values
arrange()
to reorder rows
select()
to pick variables by their names
mutate()
and case_when()
command to create new variables in our data set
summarise()
to collapse many values into a single value (like a mean or median)
group_by()
to apply functions like mutate()
and summarise()
on a group-by-group basis
All of these “verb” functions from the dplyr
package (e.g. filter()
,mutate()
) follow a similar format:
%>%
?%>%
The pipe command is way of “chaining” lines of code together, piping the results of one tidyverse
function into the next function.
The pipe command works because these functions always expect a data frame as their first argument, and always produce a data frame as their output.
%>%
summarise( data = df, mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) ) # Rewrite with a pipe: df %>% summarize( mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) )
summarise( data = df, mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) ) # Rewrite with a pipe: df %>% summarize( mean = mean(var1, na.rm = T), median = median(var1, na.rm = T) )
To work with the Covid-19 data we did the following:
Specifically, we did the following:
territories
that is a vector containing the names of U.S. territoriescovid_us
, by filtering out observations from the U.S. territoriesstate
variable that is a copy of the administrative_area_level_2
new_cases
from the confirmed
. Create a variable called new_cases_pc
that is the number of new Covid-19 cases per 100,000 citizensface_masks
from the facial_coverings
variable.face_masks
Let’s take some time to make sure we understand everything that was happening.
territories
territories
now exists in our environment.covid_us
filter()
command to select only the rows where the administrative_area_level_2
is not (!
) in (%in%
) the territories
objectstate
Copy administrative_area_level_2
into a new variable called state
Note
Note that we have to save the output of mutate back into covid_us
for our state
to exist as new column in covid_us
state
Now there’s a new column in covid_us
called state
, that we can access by calling covid_us$state
[1] "Minnesota" "Minnesota" "Minnesota" "Minnesota" "Minnesota"
We could have done the same thing in “Base” R
Why didn’t we?
tidyverse
> base R
mutate()
plays nicely with functions like group_by()
new_cases
from the confirmed
variableThe confirmed
variable contains a running total of confirmed cases in a given state on a given day.
Vizualing data helps us understand how we might need to transform our data
confirmed
variable for Rhode Islandnew_cases
from the confirmed
variableTake the difference between a given day’s value of confirmed
and yesterday’s value of confirmed
to create a measure of new_cases
on a given date for each state
Note
lag()
to shift values in a column down one row in the datagroup_by()
to respect the state-date structure of the datanew_cases_pc
new_cases
by population
to create a per capita measure (new_cases_pc
)Note
We can create multiple variables in a single mutate()
by separating lines of code with a ,
face_masks
Create a variable called face_masks
from the facial_coverings
that describes the face mask policy experienced by most people in a given state on a given date.
Note
case_when()
inside of mutate()
to create a variable that takes certain values when certain logical statements are truelevels = c(value1, value2, etc.)
argument in factor()
lets us control the ordering of categorical/character data.Recall, that the facial_coverings
variable took on range of substantive values from 0 to 4, but empirically could take both positve and negative values
covid_us %>%
mutate(
face_masks = case_when(
facial_coverings == 0 ~ "No policy",
abs(facial_coverings) == 1 ~ "Recommended",
abs(facial_coverings) == 2 ~ "Some requirements",
abs(facial_coverings) == 3 ~ "Required shared places",
abs(facial_coverings) == 4 ~ "Required all times",
) %>% factor(.,
levels = c("No policy","Recommended",
"Some requirements",
"Required shared places",
"Required all times")
)
) -> covid_us
covid_us%>%
filter(state == "Illinois", date > "2020-9-28") %>%
select(state, date, facial_coverings, face_masks) %>%
slice(1:5)
# A tibble: 5 × 4
# Groups: state [1]
state date facial_coverings face_masks
<chr> <date> <int> <fct>
1 Illinois 2020-09-29 2 Some requirements
2 Illinois 2020-09-30 2 Some requirements
3 Illinois 2020-10-01 -4 Required all times
4 Illinois 2020-10-02 -4 Required all times
5 Illinois 2020-10-03 -4 Required all times
In last week’s lab, we also added the following
R treats dates differently
If R knows a variable is a date, we can extract components of that date, using functions from the lubridate
package
str_pad()
and paste()
functionstr_pad()
function lets us ‘pad’ strings so that they’re all the same width[1] 1 1 1
[1] "01" "01" "01"
paste
function lets us paste objects together.new_cases
by face_mask
policyCalculate the mean (average) number of new_cases
of Covid-19 when each type of face_mask
policy was in effect
Note
group_by()
command will do each calculation inside of summarise()
for each level of the grouping variablenew_cases
by face_mask
policy by monthCalculate the mean (average) number of new_cases
of Covid-19 when each type of face_mask
policy was in effect for each year_month
in our dataset
Note
group_by()
command can group on multiple variables# A tibble: 102 × 3
# Groups: face_masks [5]
face_masks year_month new_cases_pc
<fct> <chr> <dbl>
1 No policy 2020-01 0.000463
2 No policy 2020-02 0.00188
3 No policy 2020-03 1.70
4 No policy 2020-04 6.50
5 No policy 2022-04 19.8
6 No policy 2022-05 20.4
7 No policy 2022-06 37.6
8 No policy 2022-07 36.2
9 No policy 2022-08 35.7
10 No policy 2022-09 19.0
# ℹ 92 more rows
# In base R:
mean(
covid_us$new_cases_pc[
covid_us$face_masks == "No policy" &
covid_us$year_month == "2020-01"], na.rm = T)
[1] 0.0004626161
Suppose you want to do the following, what function or functions would you use:
R
Suppose you want to do the following, what function or functions would you use:
R
read_xxx()
(tidy), read.xxx()
(base)head()
, tail()
, glimpse()
, table()
, summary()
, View()
data %>% filter(x > 0)
, data[data$x > 0]
, subset(data, x > 0)
data$variable
, data %>% select(variable1, variable2)
, data[,c("x1","x2")]
data %>% mutate(x = y/10)
data$x <- data$y/10
data %>% summarise(x_mn = mean(x, na.rm=T))
data %>% group_by(g) %>% summarise(x_mn = mean(x, na.rm=T))
Should you know exactly how to do all of this?
NO! Of course not. For Pete’s sake, Paul, It’s only the second week
Will you learn how to do much of this?
Maybe, but I’m feeling pretty overwhelmed…
How will you learn how do these things?
With lots of practice, patience, and repetition motivated by a sense that these skills will help me learn about things I care about
When social scientists talk about descriptive inference, we’re trying to summarize our data and make claims about what’s typical of our data
Here are some common ways of summarizing data and how to calculate them with R
Description | Usage |
---|---|
sum | sum(x) |
minimum | min(x) |
maximum | max(x) |
range | range(x) |
mean | mean(x) |
median | median(x) |
percentile | quantile(x) |
variance | var(x) |
standard deviation | sd(x) |
rank | rank(x) |
All of these functions have an argument called na.rm=F
. If your data have missing values, you’ll need to set na.rm=F
(e.g. mean(x, na.rm=T)
)
Measures of typical values
mean()
) all the timemedian()
) useful for describing distributions of variables particularly those with extreme valuesMeasures of typical variation
var()
important for quantifying uncertainty, but rarely will you be calculating this directlysd()
a good summary of a typical change in the data.range()
, min()
, max()
useful for exploring data, detecting outliers and potential values that need to be recodedMeasures of association
var()
) central to describing relationships but generally not something you’ll calculate or interpret directlycor()
) useful for describing [bivariate] relationships (positive or negative relationships).We won’t spend much time on the formal definitions, math, and proofs
ˉx=1nn∑i=1xi
Mx=Xi:∫xi−∞fx(X)dx=∫∞xifx(X)dx=1/2
Useful eventually. Not necessary right now.
Data visualization is an incredibly valuable tool that helps us to
Take a look at how the BBC uses R to produce its graphics
Today, we will:
grammar of graphics
ggplot()
Inspired by Wilkinson (2005)
A statistical graphic is a mapping of
data
variables toaes
thetic attributes ofgeom
etric objects.
At a minimum, a graphic contains three core components:
data:
the dataset containing the variables of interest.aes
: aesthetic attributes of the geometric object. For example, x/y position, color, shape, and size. Aesthetic attributes are mapped to variables in the dataset.geom:
the geometric object in question. This refers to the type of object we can observe in a plot For example: points, lines, and bars.In R, we’ll implement this grammar of graphics using the ggplot
package
data
, aes
thetics, geom
etries, and stat
istics I want it to plot<labelled<double>[12]>: You're on a road trip with friends. Who controls the music?
[1] NA 3 1 2 3 2 2 2 1 2 2 NA
Labels:
value
1
2
3
label
The driver, duh.
The front seat, of course
That jerk in the back who you don't even know but seems to have really strong feelings about Billy Joel's "Only the good die young"
# A tibble: 12 × 1
Playist
<fct>
1 <NA>
2 "That jerk in the back who you don't even know but seems to have really stro…
3 "The driver, duh."
4 "The front seat, of course"
5 "That jerk in the back who you don't even know but seems to have really stro…
6 "The front seat, of course"
7 "The front seat, of course"
8 "The front seat, of course"
9 "The driver, duh."
10 "The front seat, of course"
11 "The front seat, of course"
12 <NA>
df %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count") -> fig_roadtrip
df %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count") -> fig_roadtrip
fill
aestheticdf %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count")+ scale_fill_brewer(guide="none")+ coord_flip()+ labs(title = "Who controls the playlist", x= "", y = "")-> fig_roadtrip
df %>% filter(!is.na(trip)) %>% mutate( Playlist =str_wrap(forcats::as_factor(trip),20) ) %>% ggplot(aes(x = Playlist, fill = Playlist))+ geom_bar(stat = "count")+ scale_fill_brewer(guide="none")+ coord_flip()+ labs(title = "Who controls the playlist", x= "", y = "")-> fig_roadtrip
In the remaining slides, we’ see how to visualize some distributions and associations in the Covid data using:
summarize()
and other data wrangling skills to transform data for plottingfactor()
and related functions to control order of labels on axisggplot
What was the most common face mask policy in the data?
covid_us %>%
ungroup() %>%
mutate(
face_masks = forcats::fct_infreq(face_masks)
) %>%
ggplot(aes(x=face_masks,
fill = face_masks))+
geom_bar()+
geom_text(stat='count', aes(label=..count..),
hjust=.5,vjust=-.5)+
guides(fill = "none")+
theme_bw()+
labs(
x = "Face Mask Policy ",
title = ""
) -> fig_barplot
What does the distribution of new Covid-19 cases look like in June 2021
covid_us %>%
filter(year_month == "2021-06") %>%
filter(new_cases > 0) %>%
ggplot(aes(x=new_cases))+
geom_histogram() +
labs(
title = "Exclude Negative Values"
) -> fig_hist2a
covid_us %>%
filter(year_month == "2021-06") %>%
filter(new_cases > 0) %>%
ggplot(aes(x=new_cases))+
geom_histogram() +
scale_x_log10()+
labs(
title = "Exclude Negative Values & Use log scale"
) -> fig_hist2b
fig_hist2 <- ggarrange(fig_hist2a, fig_hist2b)
What does the distribution of Covid-19 deaths look like?
covid_us %>%
mutate(
new_deaths = deaths - lag(deaths),
new_deaths_pc = deaths - lag(deaths),
year_f = factor(year)
) %>%
filter(new_deaths > 0) %>%
ggplot(aes(x=new_deaths_pc,
col = year_f))+
geom_density() +
geom_rug() +
scale_x_log10() +
facet_wrap(~month)+
theme(legend.position = "bottom")->
fig_density2
How did the distribution of Covid-19 cases vary by face mask policy?
covid_us %>%
mutate(
Month = lubridate::month(date, label = T)
) %>%
filter(new_cases_pc > 0) %>%
filter(year == 2020) %>%
ggplot(aes(x= face_masks,
y=new_cases_pc,
col = face_masks))+
scale_y_log10()+
coord_flip() +
geom_boxplot() +
facet_wrap(~Month) +
theme(
legend.position = "bottom"
)-> fig_boxplot2
How did vaccination rates vary by state?
covid_us %>%
ungroup() %>%
mutate(
Label = case_when(
date == max(date) & percent_vaccinated == max(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
date == max(date) & percent_vaccinated == median(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
date == max(date) & percent_vaccinated == min(percent_vaccinated[date == max(date)], na.rm = T) ~ state,
TRUE ~ NA_character_
),
line_alpha = case_when(
state %in% c("District of Columbia", "Nebraska", "Wyoming") ~ 1,
T ~ .3
),
line_col = case_when(
state %in% c("District of Columbia", "Nebraska", "Wyoming") ~ "black",
T ~ "grey"
)
) %>%
ggplot(
aes(x= date,
y=percent_vaccinated,
group = state
))+
geom_line(
aes(alpha = line_alpha,
col =line_col)) +
geom_text_repel(aes(label = Label),
direction = "x",
nudge_y = 2) +
guides(
alpha = "none",
col = "none"
)+
xlim(ym("2021-01"), ym("2023-01")) +
labs(
y = "Percent Vacinated",
x = "Date"
) +
theme_bw()-> fig_line2
What’s the relationship between vaccination rates and new cases of Covid-19?
The grammar of graphics provides a language for translating data into figures
At a minimum figures with ggplot()
require three things:
To produce a figure:
Learning to code is hard, but the more errors you make now, the easier your life will be in the future
POLS 1600