Data Transformation

A typical data science project:

Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.

In this chapter we will learn how to transform data using dplyr, a part of the tidyverse, and a dataset on flights departing New York City in 2013.

Prerequisites

In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.

library(nycflights13)
library(tidyverse)

nycflights13 data

  • Available from the nycflights13 package.

  • 336,776 flights that departed from New York City in 2013:

head(flights)
dim(flights)
## [1] 336776     19

dplyr basics

  • Pick observations by their values: filter().

  • Reorder the rows: arrange().

  • Pick variables by their names: select().

  • Create new variables with functions of existing variables: mutate().

  • Collapse many values down to a single summary: summarise().

All verbs work similarly:

  1. The first argument is a data frame.

  2. The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).

  3. The result is a new data frame.

Filter rows with filter()

  • Flights on January 1:
    filter(flights, month == 1, day == 1)

  • To save the result,

    jan1 <- filter(flights, month == 1, day == 1)

Comparison operators

>, >=, <, <=, != (not equal), and == (equal).

filter(flights, month = 1)  # the easiest mistake
## Error: Problem with `filter()` input `..1`.
## x Input `..1` is named.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 1`?

Be extremely cautious when comparing floating point numbers (computer representation of real numbers):

sqrt(2) ^ 2 == 2
## [1] FALSE
1 / 49 * 49 == 1
## [1] FALSE

Computers use finite precision arithmetic. Every number you see is an approximation.

Instead of relying on ==, use near():

near(sqrt(2) ^ 2,  2)
## [1] TRUE
near(1 / 49 * 49, 1)
## [1] TRUE

Logical (boolean) operators

Complete set of boolean operations. `x` is the left-hand circle, `y` is the right-hand circle, and the shaded region show which parts each operator selects.

Complete set of boolean operations. x is the left-hand circle, y is the right-hand circle, and the shaded region show which parts each operator selects.

  • Flights in Nov or Dec:
    flightsNovDec <- filter(flights, month == 11 | month == 12)
    head(flightsNovDec)
  • More complex logical expressions:
    filter(flights, !(arr_delay > 120 | dep_delay > 120))
    filter(flights, arr_delay <= 120, dep_delay <= 120)

Whenever you start using complicated, multipart expressions in filter(), consider making them explicit variables instead. That makes it much easier to check your work.

Caution As well as & and |, R also has && and ||. Don’t use them here!

Missing values

  • NA (not available) represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknowns:
NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA

NA == NA???

NA == NA
## [1] NA

It’s easiest to understand why this is true with a bit more context:

# Let x be Mary's age. We don't know how old she is.
x <- NA

# Let y be John's age. We don't know how old he is.
y <- NA

# Are John and Mary the same age?
x == y
## [1] NA
# We don't know!

If you want to determine if a value is missing, use is.na():

is.na(x)
## [1] TRUE

filter() only includes rows where the condition is TRUE; it excludes both FALSE and NA values.

df <- tibble(x = c(1, NA, 3))  # tibble is a data frame
filter(df, x > 1)
filter(df, is.na(x) | x > 1)

Arrange rows with arrange()

arrange() changes the order of rows. It takes a data frame and a set of column names (or more complicated expressions) to order by.

  • Sort in ascending order:
    tmp <- arrange(flights, year, month, day)
    head(tmp)

  • Sort in descending order:
    tmp <- arrange(flights, desc(arr_delay))
    head(tmp)

  • Missing values are always sorted at the end:
    df <- tibble(x = c(5, 2, NA))
    arrange(df, x)
    arrange(df, desc(x))

Select columns with select()

It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in.

  • Select columns by variable name:
    tmp <- select(flights, year, month, day)
    head(tmp)

  • Select columns between two variables:
    tmp <- select(flights, year:day)
    head(tmp)

  • Select columns except those between two variables:
    tmp <- select(flights, -(year:day))
    head(tmp)

  • Select columns by positions:
    tmp <- select(flights, seq(1, 10, by = 2))
    head(tmp)

  • Move variables to the start of data frame:
    tmp <- select(flights, time_hour, air_time, everything())  # all variables
    head(tmp)

  • Rename a variable:
    tmp <- rename(flights, tail_num = tailnum)
    head(tmp)

  • Helper functions.

    • starts_with("abc"): matches names that begin with “abc”.

    • ends_with("xyz"): matches names that end with “xyz”.

    • contains("ijk"): matches names that contain “ijk”.

    • matches("(.)\\1"): selects variables that match a regular expression (more details later).

    • num_range("x", 1:3): matches x1, x2 and x3.

Add new variables with mutate()

  • Add columns (variables) gain and speed:
    flights_sml <- 
      select(flights, year:day, ends_with("delay"), distance, air_time)
    head(flights_sml)
    tmp <- mutate(flights_sml, gain = arr_delay - dep_delay, 
           speed = distance / air_time * 60 )
    head(tmp)
  • Can refer to columns that you’ve just created:
    tmp <- mutate(flights_sml, gain = arr_delay - dep_delay,
      hours = air_time / 60, gain_per_hour = gain / hours )
    head(tmp)

  • Only keep the new variables by transmute():
    tmp <- transmute(flights, gain = arr_delay - dep_delay,
      hours = air_time / 60, gain_per_hour = gain / hours )
    head(tmp)

Useful creation functions

  • Arithmetic operators: +, -, *, /, ^: air_time / 60, hours * 60 + minute, x / sum(x), y - mean(y), etc.

  • Modular arithmetic: %/% (integer division) and %% (remainder).

    tmp <- transmute(flights,
      dep_time,
      hour = dep_time %/% 100,  # why divide by 100?
      minute = dep_time %% 100
    )
    head(tmp)
  • Logs: log(), log2(), log10().

  • Offsets: lead() and lag(). Leading or lagging values.

    (x <- 1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
    lag(x)
##  [1] NA  1  2  3  4  5  6  7  8  9
    lead(x)
##  [1]  2  3  4  5  6  7  8  9 10 NA
  • Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(), cummax(),
    and dplyr::cummean().
    x
##  [1]  1  2  3  4  5  6  7  8  9 10
    cumsum(x)
##  [1]  1  3  6 10 15 21 28 36 45 55
    cummean(x)
##  [1] 1.000000 1.000000 1.333333 1.750000 2.200000 2.666667 3.142857 3.625000
##  [9] 4.111111 4.600000
  • Logical comparisons, <, <=, >, >=, !=.

  • Ranking: start with min_rank() (there are other ranking functions)

    y <- c(1, 2, 2, NA, 3, 4)
    min_rank(y)
## [1]  1  2  2 NA  4  5
    min_rank(desc(y))
## [1]  5  3  3 NA  2  1

Grouped summaries with summarise()

  • Mean of a variable:
    summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

  • Useful if paired with group_by():
    by_day <- group_by(flights, year, month, day)
    head(by_day) # grouping doesn't change how the data looks 
    # It changes how it acts with the other dplyr verbs
    summarise(by_day, delay = mean(dep_delay, na.rm = TRUE)) 
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

Pipe

  • Consider following analysis:
    by_dest <- group_by(flights, dest)
    delay <- summarise(by_dest, 
      count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) )
    (delay <- filter(delay, count > 20, dest != "HNL"))

  • Cleaner code using pipe %>%:
    (
     delays <- flights %>% 
      group_by(dest) %>% 
      summarise(
        count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% 
      filter(count > 20, dest != "HNL")
    )
## `summarise()` ungrouping output (override with `.groups` argument)

  • Unfortunately ggplot2 does not support pipe.
    ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
      geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Other summary functions

  • Location: mean(x), median(x).
    not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        avg_delay1 = mean(arr_delay),
        avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
      )
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

  • Spread: sd(x), IQR(x), mad(x).
    not_cancelled %>% 
      group_by(dest) %>% 
      summarise(distance_sd = sd(distance)) %>% 
      arrange(desc(distance_sd))
## `summarise()` ungrouping output (override with `.groups` argument)

  • Rank: min(x), quantile(x, 0.25), max(x).
    # When do the first and last flights leave each day?
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        first = min(dep_time),
        last = max(dep_time)
      )
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

  • Position: first(x), nth(x, 2), last(x).
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(
        first_dep = first(dep_time), 
        last_dep = last(dep_time)
      )
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

  • Count: n(x), sum(!is.na(x)), n_distinct(x).
    # Which destinations have the most carriers?
    not_cancelled %>% 
      group_by(dest) %>% 
      summarise(carriers = n_distinct(carrier)) %>% 
      arrange(desc(carriers))
## `summarise()` ungrouping output (override with `.groups` argument)

    not_cancelled %>% 
      count(dest)

    not_cancelled %>% 
      count(tailnum, wt = distance) # total number of miles a plane flew:

    # How many flights left before 5am? (these usually indicate delayed
    # flights from the previous day)
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(n_early = sum(dep_time < 500))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

    # What proportion of flights are delayed by more than an hour?
    not_cancelled %>% 
      group_by(year, month, day) %>% 
      summarise(hour_perc = mean(arr_delay > 60))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)

Grouped filters

Combination of group_by() with filter() or mutate().

  • Find the worst members of each group:
    flights_sml %>% 
      group_by(year, month, day) %>%
      filter(rank(desc(arr_delay)) < 10)

  • Find all groups bigger than a threshold:
    popular_dests <- flights %>% 
      group_by(dest) %>% 
      filter(n() > 10000 )
    dim(popular_dests)
## [1] 131440     19
    head(popular_dests)