Data Exploration

A typical data science project:

Data Visualization

Tidyverse and ggplot2

  • In this chapter we will learn how to visualise data using ggplot2, a part of the tidyverse.

  • Install tidyverse, if you have not:

    install.packages("tidyverse")
  • After installation, load tidyverse by

    library("tidyverse")
    ## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.0 ──
    ## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
    ## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
    ## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
    ## ✓ readr   1.3.1     ✓ forcats 0.5.0
    ## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
    ## x dplyr::filter() masks stats::filter()
    ## x dplyr::lag()    masks stats::lag()

The mpg data frame

  1. Do cars with big engines use more fuel than cars with small engines?
  2. What does the relationship between engine size and fuel efficiency look like?
  3. Is it positive? Negative? Linear? Nonlinear?
  • mpg data frame can be found in the ggplot2 package (aka ggplot2::mpg):
    mpg
  • displ: engine size, in litres.
    hwy: highway fuel efficiency, in mile per gallen (mpg).

Creating a ggplot

  • Scatterplot of hwy vs displ:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy))

  • ggplot() creates a coordinate system that you can add layers to.

  • First argument of ggplot() is the dataset to use in the graph.

  • Function geom_point() adds a layer of points to your plot.

  • mapping argument: defines how variables in your dataset are mapped to visual properties.

    • Always paired with aes()
    • x and y arguments of aes() specify which variables to map to the x and y axes.
  • ggplot2 looks for the mapped variable in the data argument, in this case, mpg.

Graphing template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Aesthetic mappings

How can you explain the red dots?

  • Aesthetic: visual property of the objects in your plot.
    • includes size, shape, or color of points.

Color of points

You can map the colors of your points to the class variable to reveal the class of each car:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

What can you say about the red dots in the previous plot?

Size of points

  • Assign different sizes to points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, size = class))

Transparency of points (“alpha”)

  • Assign different transparency levels to points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
    ## Warning: Using alpha for a discrete variable is not advised.

Shape of points

  • Assign different shapes to points according to class:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy, shape = class))

R has 25 built in shapes that are identified by numbers. Beware some seeming duplicates. The difference comes from the interaction of the `colour` and `fill` aesthetics

R has 25 built in shapes that are identified by numbers. Beware some seeming duplicates. The difference comes from the interaction of the colour and fill aesthetics

  • Maximum of 6 shapes at a time. By default, additional groups will go unplotted.

Manual setting of an aesthetic

  • Set the color of all points to be blue:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Facets

  • Another way to add additional variables

  • Facets divide a plot into subplots based on the values of one or more discrete variables.

  • A subplot for each car type, facet_wrap():

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      facet_wrap(~ class, nrow = 2)


  • A subplot for each car type and drive, facet_grid():

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      facet_grid(drv ~ class)

Geometric objects

geom_smooth(): smooth line

How are these two plots similar?

# left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))   # point geom

# right
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))  # smooth geom

They use different geoms.

Recall that every geom function in ggplot2 takes a mapping argument.

  • Not every aesthetic works with every geom.

Different line types

  • Different line types according to drv:

    ggplot(data = mpg) + 
      geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

Different line colors

  • Different line colors according to drv:

    ggplot(data = mpg) + 
      geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

Points and lines

  • Plot containing two geoms in the same graph!

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy)) + 
      geom_smooth(mapping = aes(x = displ, y = hwy))


  • Same as

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point() + geom_smooth()

Aesthetics for each geometric object

  • Different aesthetics in different layers:

    ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point(mapping = aes(color = class)) + 
      geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

  • Compare this with

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

(You’ll learn how filter() works in the next chapter: for now, just know that this command selects only the subcompact cars.)

Statistical transformations

diamonds data

Source: de Beers.

   diamonds
   nrow(diamonds)
## [1] 53940

Bar chart

  • Total number of diamonds in the diamonds dataset, grouped by cut:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut))  # a new geom


  • count is not a variable in diamonds .

  • Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.

  • New values are computed via statistical transformations (stats).

  • Check available computed variables for a geometric object via help:

    ?geom_bar

  • ?geom_bar shows that the default value for stat is “count”, which means that geom_bar() uses stat_count().

  • Use stat_count() directly:

    ggplot(data = diamonds) + 
      stat_count(mapping = aes(x = cut))

  • stat_count() has a default geom geom_bar().


  • Display relative frequencies instead of counts:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))    


  • Custom stat:

    ggplot(data = diamonds) + 
      stat_summary(
        mapping = aes(x = cut, y = depth),
        fun.min = min,
        fun.max = max,
        fun = median
      )

Position adjustments

  • Color bar:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, colour = cut))


  • Fill color:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = cut))


  • Fill color according to another variable:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument. The default behaviour is to stack bars on top of each other. The following code (you do not need to know the details now) shows the counts of clarity categories of the “Ideal” cut:

  diamonds %>% filter(cut == "Ideal") %>% select(clarity) %>% table()
## .
##   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
##  146 2598 4282 5071 3589 2606 2047 1212

See the heights of bars are proportional to these counts. (“I1” takes too small porportion and hard to see in the chart.)

If you don’t want a stacked bar chart, you can use one of three other options: "identity", "dodge" or "fill".


  • position_identity() place each object exactly where it falls (that is, each bar start from 0 are are is superposed on each other):

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")

    position="identity" is a shorthand for position_identity().

  • position_dodge() arrange elements side by side:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

    This is like position_identity() spread over the x-axis.

  • position_fill() stack elements on top of one another, like the default behaviour, but normalize height:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

  • position_stack() recovers the default plot:

    ggplot(data = diamonds) + 
      geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")


  • position_jitter() add random noise to X and Y position of each element to avoid overplotting:

    ggplot(data = mpg) + 
      geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

    geom_jitter() is a shorthand for geom_point(position = "jitter"):

Coordinate systems


  • A boxplot:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot()


  • coord_cartesian() is the default cartesian coordinate system:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() + 
      coord_cartesian(xlim = c(0, 5))


  • coord_fixed() specifies aspect ratio:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() + 
      coord_fixed(ratio = 1/2)


  • coord_flip() flips x- and y- axis:

    ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
      geom_boxplot() + 
      coord_flip()


  • A map:

    install.packages("maps")  # need to install this package
    library("maps")
    ## 
    ## Attaching package: 'maps'
    ## The following object is masked from 'package:purrr':
    ## 
    ##     map
    nz <- map_data("nz")
    
    ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black")


  • coord_quickmap() puts maps in scale:

    ggplot(nz, aes(long, lat, group = group)) +
      geom_polygon(fill = "white", colour = "black") +
      coord_quickmap()


  • coord_polar() uses polar coordinates.

    bar <- ggplot(data = diamonds) + 
      geom_bar(
        mapping = aes(x = cut, fill = cut), 
        show.legend = FALSE,
        width = 1
      ) + 
      theme(aspect.ratio = 1) +
      labs(x = NULL, y = NULL)
    
    bar + coord_flip()

    bar + coord_polar()

Recap: layered grammar of graphics

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

Graphics for communications (ch. 28)

Title

  • Figure title should be descriptive:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth(se = FALSE) +
      labs(title = "Fuel efficiency generally decreases with engine size")

Subtitle and caption

  • ggplot(mpg, aes(displ, hwy)) +
    geom_point(aes(color = class)) +
    geom_smooth(se = FALSE) + 
    labs(
      title = "Fuel efficiency generally decreases with engine size",
      subtitle = "Two seaters (sports cars) are an exception because of their light weight",
      caption = "Data from fueleconomy.gov"
    )

Axis labels

  • ggplot(mpg, aes(displ, hwy)) +
    geom_point(aes(colour = class)) +
    geom_smooth(se = FALSE) +
    labs(
      x = "Engine displacement (L)",
      y = "Highway fuel economy (mpg)"
    )

Math equations

  • df <- tibble(x = runif(10), y = runif(10))
    ggplot(df, aes(x, y)) + geom_point() +
      labs(
        x = quote(sum(x[i] ^ 2, i == 1, n)),
        y = quote(alpha + beta + frac(delta, theta))
      )

  • ?plotmath

Annotations

  • Create labels

    best_in_class <- mpg %>%
      group_by(class) %>%
      filter(row_number(desc(hwy)) == 1)
    best_in_class
    <script data-pagedtable-source type="application/json">
    {“columns”:[{“label”:[“manufacturer”],“name”:[1],“type”:[“chr”],“align”:[“left”]},{“label”:[“model”],“name”:[2],“type”:[“chr”],“align”:[“left”]},{“label”:[“displ”],“name”:[3],“type”:[“dbl”],“align”:[“right”]},{“label”:[“year”],“name”:[4],“type”:[“int”],“align”:[“right”]},{“label”:[“cyl”],“name”:[5],“type”:[“int”],“align”:[“right”]},{“label”:[“trans”],“name”:[6],“type”:[“chr”],“align”:[“left”]},{“label”:[“drv”],“name”:[7],“type”:[“chr”],“align”:[“left”]},{“label”:[“cty”],“name”:[8],“type”:[“int”],“align”:[“right”]},{“label”:[“hwy”],“name”:[9],“type”:[“int”],“align”:[“right”]},{“label”:[“fl”],“name”:[10],“type”:[“chr”],“align”:[“left”]},{“label”:[“class”],“name”:[11],“type”:[“chr”],“align”:[“left”]}],“data”:[{“1”:“chevrolet”,“2”:“corvette”,“3”:“5.7”,“4”:“1999”,“5”:“8”,“6”:“manual(m6)”,“7”:“r”,“8”:“16”,“9”:“26”,“10”:“p”,“11”:“2seater”},{“1”:“dodge”,“2”:“caravan 2wd”,“3”:“2.4”,“4”:“1999”,“5”:“4”,“6”:“auto(l3)”,“7”:“f”,“8”:“18”,“9”:“24”,“10”:“r”,“11”:“minivan”},{“1”:“nissan”,“2”:“altima”,“3”:“2.5”,“4”:“2008”,“5”:“4”,“6”:“manual(m6)”,“7”:“f”,“8”:“23”,“9”:“32”,“10”:“r”,“11”:“midsize”},{“1”:“subaru”,“2”:“forester awd”,“3”:“2.5”,“4”:“2008”,“5”:“4”,“6”:“manual(m5)”,“7”:“4”,“8”:“20”,“9”:“27”,“10”:“r”,“11”:“suv”},{“1”:“toyota”,“2”:“toyota tacoma 4wd”,“3”:“2.7”,“4”:“2008”,“5”:“4”,“6”:“manual(m5)”,“7”:“4”,“8”:“17”,“9”:“22”,“10”:“r”,“11”:“pickup”},{“1”:“volkswagen”,“2”:“jetta”,“3”:“1.9”,“4”:“1999”,“5”:“4”,“6”:“manual(m5)”,“7”:“f”,“8”:“33”,“9”:“44”,“10”:“d”,“11”:“compact”},{“1”:“volkswagen”,“2”:“new beetle”,“3”:“1.9”,“4”:“1999”,“5”:“4”,“6”:“manual(m5)”,“7”:“f”,“8”:“35”,“9”:“44”,“10”:“d”,“11”:“subcompact”}],“options”:{“columns”:{“min”:{},“max”:[10]},“rows”:{“min”:[10],“max”:[10]},“pages”:{}}}

  • Annotate points

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point(aes(colour = class)) +
      geom_text(aes(label = model), data = best_in_class)


  • ggrepel package automatically adjust labels so that they don’t overlap:

    install.packages("ggrepel")
    library("ggrepel")
    ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class)) +
      geom_point(size = 3, shape = 1, data = best_in_class) +
      ggrepel::geom_label_repel(aes(label = model), data = best_in_class)

Scales

  • ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class))

    automatically adds scales

    ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class)) +
      scale_x_continuous() +
      scale_y_continuous() +
      scale_colour_discrete()

  • breaks

    ggplot(mpg, aes(displ, hwy)) +
      geom_point() +
      scale_y_continuous(breaks = seq(15, 40, by = 5))


  • labels

    ggplot(mpg, aes(displ, hwy)) +
      geom_point() +
      scale_x_continuous(labels = NULL) +
      scale_y_continuous(labels = NULL)


  • Plot y-axis at log scale:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      scale_y_log10()


  • Plot x-axis in reverse order:

    ggplot(mpg, aes(x = displ, y = hwy)) +
      geom_point() +
      scale_x_reverse()

Legends

  • Set legend position: "left", "right", "top", "bottom", none:

    ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(colour = class)) + 
      theme(legend.position = "left")


Zooming

  • Without clipping (does not remove unseen data points)

    ggplot(mpg, mapping = aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth() +
      coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))


  • With clipping (removes unseen data points)

    ggplot(mpg, mapping = aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth() +
      xlim(5, 7) + ylim(10, 30)


  • ggplot(mpg, mapping = aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth() +
      scale_x_continuous(limits = c(5, 7)) +
      scale_y_continuous(limits = c(10, 30))

  • mpg %>%
      filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
      ggplot(aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth()

Themes

  • ggplot(mpg, aes(displ, hwy)) +
      geom_point(aes(color = class)) +
      geom_smooth(se = FALSE) +
      theme_bw()

The eight themes built-in to ggplot2.

The eight themes built-in to ggplot2.

Saving plots

ggplot(mpg, aes(displ, hwy)) + geom_point()

ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image

More information on ggplot2

ggplot2 provides over 30 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://rstudio.com/resources/cheatsheets/.