This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.
We will spend next couple lectures studying R. I’ll closely follow a few great books by Hadley Wickham.
Data wrangling (import, visualization, transformation, tidy).
R for Data Science by Garrett Grolemund and Hadley Wickham.
R programming, Rcpp.
Advanced R by Hadley Wickham.
R package development.
R Packages by Hadley Wickham.
A typical data science project:
tidyverse is a collection of R packages that make data wrangling easy.
Install tidyverse from RStudio menu Tools -> Install Packages... or
install.packages("tidyverse")After installation, load tidyverse by
library("tidyverse")## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0## Warning: package 'dplyr' was built under R version 3.5.1## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()mpg datampg data is available from the ggplot2 package:
mpgdispl: engine size, in litres.
hwy: highway fuel efficiency, in mile per gallen (mpg).
hwy vs displ
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))Check available aesthetics for a geometric object by ?geom_point.
Color points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))Assign different sizes to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))Assign different transparency levels to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))## Warning: Using alpha for a discrete variable is not advised.Assign different shapes to points according to class:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
Set the color of all points to be blue:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)A subplot for each car type and drive:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)geom_smooth(): smooth linehwy vs displ line:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))Different line types according to drv:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))Different line colors according to drv:
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))Lines overlaid over scatter plot:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth()Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)diamonds datadiamonds data:
diamondsgeom_bar() creates bar chart:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
Check available computed variables for a geometric object via help:
?geom_barUse stat_count() directly:
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))stat_count() has a default geom geom_bar().
Display frequency instead of counts:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))    Color bar:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))Fill color:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))Fill color according to another variable:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))position_jitter() add random noise to X and Y position of each element to avoid overplotting:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")geom_jitter() is similar:
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy))position_fill() stack elements on top of one another, normalize height:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")position_dodge() arrange elements side by side:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")position_stack() stack elements on top of each other:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")A boxplot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()coord_cartesian() is the default cartesian coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_cartesian(xlim = c(0, 5))coord_fixed() specifies aspect ratio:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_fixed(ratio = 1/2)coord_flip() flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_flip()A map:
install.packages("maps")  # need to install this packagelibrary("maps")## 
## Attaching package: 'maps'## The following object is masked from 'package:purrr':
## 
##     mapnz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")coord_quickmap() puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(title = "Fuel efficiency generally decreases with engine size")ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) + 
labs(
  title = "Fuel efficiency generally decreases with engine size",
  subtitle = "Two seaters (sports cars) are an exception because of their light weight",
  caption = "Data from fueleconomy.gov"
)ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
  x = "Engine displacement (L)",
  y = "Highway fuel economy (mpg)"
)df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
  labs(
    x = quote(sum(x[i] ^ 2, i == 1, n)),
    y = quote(alpha + beta + frac(delta, theta))
  )?plotmath
Create labels
best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)
best_in_classAnnotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class)) +
  geom_text(aes(label = model), data = best_in_class)ggrepel package automatically adjust labels so that they don’t overlap:
install.packages("ggrepel")library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_point(size = 3, shape = 1, data = best_in_class) +
  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class))automatically adds scales
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()breaks
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_y_continuous(breaks = seq(15, 40, by = 5))labels
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_y_log10()Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_reverse()Set legend position: "left", "right", "top", "bottom", none:
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  theme(legend.position = "left")See following link for more details on how to change title, labels, … of a legend.
Without clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))With clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  xlim(5, 7) + ylim(10, 30)ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  scale_x_continuous(limits = c(5, 7)) +
  scale_y_continuous(limits = c(10, 30))mpg %>%
  filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
  ggplot(aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  theme_bw()
ggplot(mpg, aes(displ, hwy)) + geom_point()ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image