Data Exploration

A typical data science project:

Visualisation (Ch. 3) is a great place to start with R programming: the payoff is immediate.
Data transformation (Ch. 5) deals with key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
In exploratory data analysis (Ch. 7), we’ll combine visualisation and transformation in order to ask and answer interesting questions about data.

Data Visualization

Tidyverse and ggplot2

In this chapter we will learn how to visualise data using ggplot2, a part of the tidyverse.
Install tidyverse, if you have not:
```
install.packages("tidyverse")
```

After installation, load tidyverse by

library("tidyverse")

## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

The `mpg` data frame

Do cars with big engines use more fuel than cars with small engines?
What does the relationship between engine size and fuel efficiency look like?
Is it positive? Negative? Linear? Nonlinear?

mpg data frame can be found in the ggplot2 package (aka ggplot2::mpg):

mpg

displ: engine size, in litres.
hwy: highway fuel efficiency, in mile per gallen (mpg).

Creating a ggplot

Scatterplot of hwy vs displ:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot() creates a coordinate system that you can add layers to.
First argument of ggplot() is the dataset to use in the graph.
Function geom_point() adds a layer of points to your plot.
mapping argument: defines how variables in your dataset are mapped to visual properties.
- Always paired with aes()
- x and y arguments of aes() specify which variables to map to the x and y axes.
ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

Graphing template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetic mappings

How can you explain the red dots?

Aesthetic: visual property of the objects in your plot.
- includes size, shape, or color of points.

Color of points

You can map the colors of your points to the class variable to reveal the class of each car:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

What can you say about the red dots in the previous plot?

Size of points

Assign different sizes to points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

Transparency of points (“alpha”)

Assign different transparency levels to points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

## Warning: Using alpha for a discrete variable is not advised.

Shape of points

Assign different shapes to points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

R has 25 built in shapes that are identified by numbers. Beware some seeming duplicates. The difference comes from the interaction of the colour and fill aesthetics

Maximum of 6 shapes at a time. By default, additional groups will go unplotted.

Manual setting of an aesthetic

Set the color of all points to be blue:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Facets

Another way to add additional variables
Facets divide a plot into subplots based on the values of one or more discrete variables.

A subplot for each car type, facet_wrap():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

A subplot for each car type and drive, facet_grid():

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)

Geometric objects

`geom_smooth()`: smooth line

How are these two plots similar?

# left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))   # point geom

# right
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))  # smooth geom

They use different geoms.

Recall that every geom function in ggplot2 takes a mapping argument.

Not every aesthetic works with every geom.

Different line types

Different line types according to drv:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

Different line colors

Different line colors according to drv:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

Points and lines

Plot containing two geoms in the same graph!

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Same as

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth()

Aesthetics for each geometric object

Different aesthetics in different layers:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

Compare this with

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

(You’ll learn how filter() works in the next chapter: for now, just know that this command selects only the subcompact cars.)

Statistical transformations

`diamonds` data

Source: de Beers.

   diamonds

   nrow(diamonds)

## [1] 53940

Bar chart

Total number of diamonds in the diamonds dataset, grouped by cut:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))  # a new geom

count is not a variable in diamonds .
Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
New values are computed via statistical transformations (stats).
Check available computed variables for a geometric object via help:
```
?geom_bar
```

?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count().

Use stat_count() directly:

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

stat_count() has a default geom geom_bar().

Display relative frequencies instead of counts:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

Custom stat:

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,
    fun.max = max,
    fun = median
  )

Position adjustments

Color bar:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))

Fill color:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

Fill color according to another variable:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument. The default behaviour is to stack bars on top of each other. The following code (you do not need to know the details now) shows the counts of clarity categories of the “Ideal” cut:

  diamonds %>% filter(cut == "Ideal") %>% select(clarity) %>% table()

## .
##   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
##  146 2598 4282 5071 3589 2606 2047 1212

See the heights of bars are proportional to these counts. (“I1” takes too small porportion and hard to see in the chart.)

If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill".

position_identity() place each object exactly where it falls (that is, each bar start from 0 are are is superposed on each other):
```
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")
```
position="identity" is a shorthand for position_identity().
position_dodge() arrange elements side by side:
```
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
```
This is like position_identity() spread over the x-axis.
position_fill() stack elements on top of one another, like the default behaviour, but normalize height:
```
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
```

position_stack() recovers the default plot:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")

position_jitter() add random noise to X and Y position of each element to avoid overplotting:
```
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```
geom_jitter() is a shorthand for geom_point(position = "jitter"):

Coordinate systems

A boxplot:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

coord_cartesian() is the default cartesian coordinate system:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_cartesian(xlim = c(0, 5))

coord_fixed() specifies aspect ratio:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_fixed(ratio = 1/2)

coord_flip() flips x- and y- axis:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() + 
  coord_flip()

A map:

install.packages("maps")  # need to install this package

library("maps")

## 
## Attaching package: 'maps'

## The following object is masked from 'package:purrr':
## 
##     map

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

coord_quickmap() puts maps in scale:

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

coord_polar() uses polar coordinates.

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

Recap: layered grammar of graphics

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Graphics for communications (ch. 28)

Title

Figure title should be descriptive:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(title = "Fuel efficiency generally decreases with engine size")

Subtitle and caption

ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) + 
labs(
  title = "Fuel efficiency generally decreases with engine size",
  subtitle = "Two seaters (sports cars) are an exception because of their light weight",
  caption = "Data from fueleconomy.gov"
)

Axis labels

ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
  x = "Engine displacement (L)",
  y = "Highway fuel economy (mpg)"
)

Math equations

df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
  labs(
    x = quote(sum(x[i] ^ 2, i == 1, n)),
    y = quote(alpha + beta + frac(delta, theta))
  )

?plotmath

Annotations

Create labels
```
best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)
best_in_class
```
```
<script data-pagedtable-source type="application/json">
```
{“columns”:[{“label”:[“manufacturer”],“name”:[1],“type”:[“chr”],“align”:[“left”]},{“label”:[“model”],“name”:[2],“type”:[“chr”],“align”:[“left”]},{“label”:[“displ”],“name”:[3],“type”:[“dbl”],“align”:[“right”]},{“label”:[“year”],“name”:[4],“type”:[“int”],“align”:[“right”]},{“label”:[“cyl”],“name”:[5],“type”:[“int”],“align”:[“right”]},{“label”:[“trans”],“name”:[6],“type”:[“chr”],“align”:[“left”]},{“label”:[“drv”],“name”:[7],“type”:[“chr”],“align”:[“left”]},{“label”:[“cty”],“name”:[8],“type”:[“int”],“align”:[“right”]},{“label”:[“hwy”],“name”:[9],“type”:[“int”],“align”:[“right”]},{“label”:[“fl”],“name”:[10],“type”:[“chr”],“align”:[“left”]},{“label”:[“class”],“name”:[11],“type”:[“chr”],“align”:[“left”]}],“data”:[{“1”:“chevrolet”,“2”:“corvette”,“3”:“5.7”,“4”:“1999”,“5”:“8”,“6”:“manual(m6)”,“7”:“r”,“8”:“16”,“9”:“26”,“10”:“p”,“11”:“2seater”},{“1”:“dodge”,“2”:“caravan 2wd”,“3”:“2.4”,“4”:“1999”,“5”:“4”,“6”:“auto(l3)”,“7”:“f”,“8”:“18”,“9”:“24”,“10”:“r”,“11”:“minivan”},{“1”:“nissan”,“2”:“altima”,“3”:“2.5”,“4”:“2008”,“5”:“4”,“6”:“manual(m6)”,“7”:“f”,“8”:“23”,“9”:“32”,“10”:“r”,“11”:“midsize”},{“1”:“subaru”,“2”:“forester awd”,“3”:“2.5”,“4”:“2008”,“5”:“4”,“6”:“manual(m5)”,“7”:“4”,“8”:“20”,“9”:“27”,“10”:“r”,“11”:“suv”},{“1”:“toyota”,“2”:“toyota tacoma 4wd”,“3”:“2.7”,“4”:“2008”,“5”:“4”,“6”:“manual(m5)”,“7”:“4”,“8”:“17”,“9”:“22”,“10”:“r”,“11”:“pickup”},{“1”:“volkswagen”,“2”:“jetta”,“3”:“1.9”,“4”:“1999”,“5”:“4”,“6”:“manual(m5)”,“7”:“f”,“8”:“33”,“9”:“44”,“10”:“d”,“11”:“compact”},{“1”:“volkswagen”,“2”:“new beetle”,“3”:“1.9”,“4”:“1999”,“5”:“4”,“6”:“manual(m5)”,“7”:“f”,“8”:“35”,“9”:“44”,“10”:“d”,“11”:“subcompact”}],“options”:{“columns”:{“min”:{},“max”:[10]},“rows”:{“min”:[10],“max”:[10]},“pages”:{}}}

Annotate points

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = class)) +
  geom_text(aes(label = model), data = best_in_class)

ggrepel package automatically adjust labels so that they don’t overlap:

install.packages("ggrepel")

library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_point(size = 3, shape = 1, data = best_in_class) +
  ggrepel::geom_label_repel(aes(label = model), data = best_in_class)

Scales

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class))

automatically adds scales

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

breaks

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_y_continuous(breaks = seq(15, 40, by = 5))

labels

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  scale_x_continuous(labels = NULL) +
  scale_y_continuous(labels = NULL)

Plot y-axis at log scale:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_y_log10()

Plot x-axis in reverse order:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  scale_x_reverse()

Legends

Set legend position: "left", "right", "top", "bottom", none:

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) + 
  theme(legend.position = "left")

See following link for more details on how to change title, labels, … of a legend.

http://www.sthda.com/english/wiki/ggplot2-legend-easy-steps-to-change-the-position-and-the-appearance-of-a-graph-legend-in-r-software

Zooming

Without clipping (does not remove unseen data points)

ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))

With clipping (removes unseen data points)

ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  xlim(5, 7) + ylim(10, 30)

ggplot(mpg, mapping = aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth() +
  scale_x_continuous(limits = c(5, 7)) +
  scale_y_continuous(limits = c(10, 30))

mpg %>%
  filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
  ggplot(aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth()

Themes

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  theme_bw()

The eight themes built-in to ggplot2.

Saving plots

ggplot(mpg, aes(displ, hwy)) + geom_point()

ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image

More information on ggplot2

ggplot2 provides over 30 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://rstudio.com/resources/cheatsheets/.

Lecture 2: Data Visualisation

Joong-Ho Won @ SNU

Data Exploration

Data Visualization

Tidyverse and ggplot2

The `mpg` data frame

Creating a ggplot

Graphing template

Aesthetic mappings

Color of points

Size of points

Transparency of points (“alpha”)

Shape of points

Manual setting of an aesthetic

Facets

Geometric objects

`geom_smooth()`: smooth line

Different line types

Different line colors

Points and lines

Aesthetics for each geometric object

Statistical transformations

`diamonds` data

Bar chart

Position adjustments

Coordinate systems

Recap: layered grammar of graphics

Graphics for communications (ch. 28)

Title

Subtitle and caption

Axis labels

Math equations

Annotations

Scales

Legends

Zooming

Themes

Saving plots

More information on ggplot2

Lecture 2: Data Visualisation

Joong-Ho Won @ SNU

Data Exploration

Data Visualization

Tidyverse and ggplot2

The mpg data frame

Creating a ggplot

Graphing template

Aesthetic mappings

Color of points

Size of points

Transparency of points (“alpha”)

Shape of points

Manual setting of an aesthetic

Facets

Geometric objects

geom_smooth(): smooth line

Different line types

Different line colors

Points and lines

Aesthetics for each geometric object

Statistical transformations

diamonds data

Bar chart

Position adjustments

Coordinate systems

Recap: layered grammar of graphics

Graphics for communications (ch. 28)

Title

Subtitle and caption

Axis labels

Math equations

Annotations

Scales

Legends

Zooming

Themes

Saving plots

More information on ggplot2

The `mpg` data frame

`geom_smooth()`: smooth line

`diamonds` data