This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.
We will spend next couple lectures studying R. I’ll closely follow a few great books by Hadley Wickham.
Data wrangling (import, visualization, transformation, tidy).
R for Data Science by Garrett Grolemund and Hadley Wickham.
R programming, Rcpp.
Advanced R by Hadley Wickham.
R package development.
R Packages by Hadley Wickham.
A typical data science project:
tidyverse
is a collection of R packages that make data wrangling easy.
Install tidyverse
from RStudio menu Tools -> Install Packages...
or
install.packages("tidyverse")
After installation, load tidyverse
by
library("tidyverse")
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## Warning: package 'dplyr' was built under R version 3.5.1
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
mpg
datampg
data is available from the ggplot2
package:
mpg
displ
: engine size, in litres.
hwy
: highway fuel efficiency, in mile per gallen (mpg).
hwy
vs displ
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
Check available aesthetics for a geometric object by ?geom_point
.
Color points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Assign different sizes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Assign different transparency levels to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
Assign different shapes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
Set the color of all points to be blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
A subplot for each car type and drive:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
geom_smooth()
: smooth linehwy
vs displ
line:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Different line types according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
Different line colors according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
Lines overlaid over scatter plot:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
diamonds
datadiamonds
data:
diamonds
geom_bar()
creates bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
Check available computed variables for a geometric object via help:
?geom_bar
Use stat_count()
directly:
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
stat_count()
has a default geom geom_bar()
.
Display frequency instead of counts:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
Color bar:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
Fill color:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Fill color according to another variable:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
position_jitter()
add random noise to X and Y position of each element to avoid overplotting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_jitter()
is similar:
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))
position_fill()
stack elements on top of one another, normalize height:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position_dodge()
arrange elements side by side:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
position_stack()
stack elements on top of each other:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
A boxplot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
coord_cartesian()
is the default cartesian coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_cartesian(xlim = c(0, 5))
coord_fixed()
specifies aspect ratio:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_fixed(ratio = 1/2)
coord_flip()
flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
A map:
install.packages("maps") # need to install this package
library("maps")
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
coord_quickmap()
puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)"
)
df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
?plotmath
Create labels
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
best_in_class
Annotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
ggrepel
package automatically adjust labels so that they don’t overlap:
install.packages("ggrepel")
library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
automatically adds scales
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
breaks
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
labels
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_y_log10()
Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_reverse()
Set legend position: "left"
, "right"
, "top"
, "bottom"
, none
:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
theme(legend.position = "left")
See following link for more details on how to change title, labels, … of a legend.
Without clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
With clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
xlim(5, 7) + ylim(10, 30)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
scale_x_continuous(limits = c(5, 7)) +
scale_y_continuous(limits = c(10, 30))
mpg %>%
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image