A typical data science project:
Visualisation (Ch. 3) is a great place to start with R programming: the payoff is immediate.
Data transformation (Ch. 5) deals with key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
In exploratory data analysis (Ch. 7), we’ll combine visualisation and transformation in order to ask and answer interesting questions about data.
In this chapter we will learn how to visualise data using ggplot2, a part of the tidyverse.
Install tidyverse
, if you have not:
install.packages("tidyverse")
After installation, load tidyverse
by
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
mpg
data framempg
data frame can be found in the ggplot2
package (aka ggplot2::mpg
): mpg
displ
: engine size, in litres.hwy
: highway fuel efficiency, in mile per gallen (mpg).Scatterplot of hwy
vs displ
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot()
creates a coordinate system that you can add layers to.
First argument of ggplot()
is the dataset to use in the graph.
Function geom_point()
adds a layer of points to your plot.
mapping
argument: defines how variables in your dataset are mapped to visual properties.
aes()
x
and y
arguments of aes()
specify which variables to map to the x and y axes.ggplot2 looks for the mapped variable in the data
argument, in this case, mpg
.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
How can you explain the red dots?
You can map the colors of your points to the class
variable to reveal the class of each car:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
What can you say about the red dots in the previous plot?
Assign different sizes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Assign different transparency levels to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
Assign different shapes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
Set the color of all points to be blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Another way to add additional variables
Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type, facet_wrap()
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
A subplot for each car type and drive, facet_grid()
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
geom_smooth()
: smooth lineHow are these two plots similar?
# left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) # point geom
# right
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy)) # smooth geom
They use different geoms.
Recall that every geom function in ggplot2 takes a mapping
argument.
Different line types according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
Different line colors according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
Plot containing two geoms in the same graph!
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
Compare this with
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
(You’ll learn how filter()
works in the next chapter: for now, just know that this command selects only the subcompact cars.)
Total number of diamonds in the diamonds
dataset, grouped by cut
:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut)) # a new geom
count
is not a variable in diamonds
.
Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
New values are computed via statistical transformations (stats).
Check available computed variables for a geometric object via help:
?geom_bar
?geom_bar
shows that the default value for stat
is “count”, which means that geom_bar()
uses stat_count()
.
Use stat_count()
directly:
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
stat_count()
has a default geom geom_bar()
.
Display relative frequencies instead of counts:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))
Custom stat:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
Color bar:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
Fill color:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Fill color according to another variable:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
The stacking is performed automatically by the position adjustment specified by the position
argument. The default behaviour is to stack bars on top of each other. The following code (you do not need to know the details now) shows the counts of clarity categories of the “Ideal” cut:
diamonds %>% filter(cut == "Ideal") %>% select(clarity) %>% table()
## .
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 146 2598 4282 5071 3589 2606 2047 1212
See the heights of bars are proportional to these counts. (“I1” takes too small porportion and hard to see in the chart.)
If you don’t want a stacked bar chart, you can use one of three other options: "identity"
, "dodge"
or "fill"
.
position_identity()
place each object exactly where it falls (that is, each bar start from 0 are are is superposed on each other):
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")
position="identity"
is a shorthand for position_identity()
.
position_dodge()
arrange elements side by side:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
This is like position_identity()
spread over the x-axis.
position_fill()
stack elements on top of one another, like the default behaviour, but normalize height:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position_stack()
recovers the default plot:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
position_jitter()
add random noise to X and Y position of each element to avoid overplotting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_jitter()
is a shorthand for geom_point(position = "jitter")
:
A boxplot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
coord_cartesian()
is the default cartesian coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_cartesian(xlim = c(0, 5))
coord_fixed()
specifies aspect ratio:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_fixed(ratio = 1/2)
coord_flip()
flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
A map:
install.packages("maps") # need to install this package
library("maps")
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
coord_quickmap()
puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar()
uses polar coordinates.
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)"
)
df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
?plotmath
Create labels
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
best_in_class
<script data-pagedtable-source type="application/json">
{“columns”:[{“label”:[“manufacturer”],“name”:[1],“type”:[“chr”],“align”:[“left”]},{“label”:[“model”],“name”:[2],“type”:[“chr”],“align”:[“left”]},{“label”:[“displ”],“name”:[3],“type”:[“dbl”],“align”:[“right”]},{“label”:[“year”],“name”:[4],“type”:[“int”],“align”:[“right”]},{“label”:[“cyl”],“name”:[5],“type”:[“int”],“align”:[“right”]},{“label”:[“trans”],“name”:[6],“type”:[“chr”],“align”:[“left”]},{“label”:[“drv”],“name”:[7],“type”:[“chr”],“align”:[“left”]},{“label”:[“cty”],“name”:[8],“type”:[“int”],“align”:[“right”]},{“label”:[“hwy”],“name”:[9],“type”:[“int”],“align”:[“right”]},{“label”:[“fl”],“name”:[10],“type”:[“chr”],“align”:[“left”]},{“label”:[“class”],“name”:[11],“type”:[“chr”],“align”:[“left”]}],“data”:[{“1”:“chevrolet”,“2”:“corvette”,“3”:“5.7”,“4”:“1999”,“5”:“8”,“6”:“manual(m6)”,“7”:“r”,“8”:“16”,“9”:“26”,“10”:“p”,“11”:“2seater”},{“1”:“dodge”,“2”:“caravan 2wd”,“3”:“2.4”,“4”:“1999”,“5”:“4”,“6”:“auto(l3)”,“7”:“f”,“8”:“18”,“9”:“24”,“10”:“r”,“11”:“minivan”},{“1”:“nissan”,“2”:“altima”,“3”:“2.5”,“4”:“2008”,“5”:“4”,“6”:“manual(m6)”,“7”:“f”,“8”:“23”,“9”:“32”,“10”:“r”,“11”:“midsize”},{“1”:“subaru”,“2”:“forester awd”,“3”:“2.5”,“4”:“2008”,“5”:“4”,“6”:“manual(m5)”,“7”:“4”,“8”:“20”,“9”:“27”,“10”:“r”,“11”:“suv”},{“1”:“toyota”,“2”:“toyota tacoma 4wd”,“3”:“2.7”,“4”:“2008”,“5”:“4”,“6”:“manual(m5)”,“7”:“4”,“8”:“17”,“9”:“22”,“10”:“r”,“11”:“pickup”},{“1”:“volkswagen”,“2”:“jetta”,“3”:“1.9”,“4”:“1999”,“5”:“4”,“6”:“manual(m5)”,“7”:“f”,“8”:“33”,“9”:“44”,“10”:“d”,“11”:“compact”},{“1”:“volkswagen”,“2”:“new beetle”,“3”:“1.9”,“4”:“1999”,“5”:“4”,“6”:“manual(m5)”,“7”:“f”,“8”:“35”,“9”:“44”,“10”:“d”,“11”:“subcompact”}],“options”:{“columns”:{“min”:{},“max”:[10]},“rows”:{“min”:[10],“max”:[10]},“pages”:{}}}
Annotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
ggrepel
package automatically adjust labels so that they don’t overlap:
install.packages("ggrepel")
library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
automatically adds scales
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
breaks
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
labels
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_y_log10()
Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_reverse()
Set legend position: "left"
, "right"
, "top"
, "bottom"
, none
:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
theme(legend.position = "left")
See following link for more details on how to change title, labels, … of a legend.
Without clipping (does not remove unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
With clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
xlim(5, 7) + ylim(10, 30)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
scale_x_continuous(limits = c(5, 7)) +
scale_y_continuous(limits = c(10, 30))
mpg %>%
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image
ggplot2 provides over 30 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://rstudio.com/resources/cheatsheets/.