A typical data science project:
Visualisation is an important tool for insight generation, but it is rare that you get the data in exactly the right form you need. Often you’ll need to create some new variables or summaries, or maybe you just want to rename the variables or reorder the observations in order to make the data a little easier to work with.
In this chapter we will learn how to transform data using dplyr, a part of the tidyverse, and a dataset on flights departing New York City in 2013.
In this chapter we’re going to focus on how to use the dplyr package, another core member of the tidyverse. We’ll illustrate the key ideas using data from the nycflights13 package, and use ggplot2 to help us understand the data.
library(nycflights13)
library(tidyverse)
Available from the nycflights13 package.
336,776 flights that departed from New York City in 2013:
head(flights)
dim(flights)
## [1] 336776 19
Pick observations by their values: filter()
.
Reorder the rows: arrange()
.
Pick variables by their names: select()
.
Create new variables with functions of existing variables: mutate()
.
Collapse many values down to a single summary: summarise()
.
All verbs work similarly:
The first argument is a data frame.
The subsequent arguments describe what to do with the data frame, using the variable names (without quotes).
The result is a new data frame.
filter()
filter(flights, month == 1, day == 1)
To save the result,
jan1 <- filter(flights, month == 1, day == 1)
>
, >=
, <
, <=
, !=
(not equal), and ==
(equal).
filter(flights, month = 1) # the easiest mistake
## Error: Problem with `filter()` input `..1`.
## x Input `..1` is named.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `month == 1`?
Be extremely cautious when comparing floating point numbers (computer representation of real numbers):
sqrt(2) ^ 2 == 2
## [1] FALSE
1 / 49 * 49 == 1
## [1] FALSE
Computers use finite precision arithmetic. Every number you see is an approximation.
Instead of relying on ==
, use near()
:
near(sqrt(2) ^ 2, 2)
## [1] TRUE
near(1 / 49 * 49, 1)
## [1] TRUE
flightsNovDec <- filter(flights, month == 11 | month == 12)
head(flightsNovDec)
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
Whenever you start using complicated, multipart expressions in filter()
, consider making them explicit variables instead. That makes it much easier to check your work.
Caution As well as &
and |
, R also has &&
and ||
. Don’t use them here!
NA
(not available) represents an unknown value so missing values are “contagious”: almost any operation involving an unknown value will also be unknowns:NA > 5
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
NA == NA
???NA == NA
## [1] NA
It’s easiest to understand why this is true with a bit more context:
# Let x be Mary's age. We don't know how old she is.
x <- NA
# Let y be John's age. We don't know how old he is.
y <- NA
# Are John and Mary the same age?
x == y
## [1] NA
# We don't know!
If you want to determine if a value is missing, use is.na()
:
is.na(x)
## [1] TRUE
filter()
only includes rows where the condition is TRUE
; it excludes both FALSE
and NA
values.
df <- tibble(x = c(1, NA, 3)) # tibble is a data frame
filter(df, x > 1)
filter(df, is.na(x) | x > 1)
arrange()
arrange()
changes the order of rows. It takes a data frame and a set of column names (or more complicated expressions) to order by.
tmp <- arrange(flights, year, month, day)
head(tmp)
tmp <- arrange(flights, desc(arr_delay))
head(tmp)
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
select()
It’s not uncommon to get datasets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in.
tmp <- select(flights, year, month, day)
head(tmp)
tmp <- select(flights, year:day)
head(tmp)
tmp <- select(flights, -(year:day))
head(tmp)
tmp <- select(flights, seq(1, 10, by = 2))
head(tmp)
tmp <- select(flights, time_hour, air_time, everything()) # all variables
head(tmp)
tmp <- rename(flights, tail_num = tailnum)
head(tmp)
Helper functions.
starts_with("abc")
: matches names that begin with “abc”.
ends_with("xyz")
: matches names that end with “xyz”.
contains("ijk")
: matches names that contain “ijk”.
matches("(.)\\1")
: selects variables that match a regular expression (more details later).
num_range("x", 1:3)
: matches x1, x2 and x3.
mutate()
gain
and speed
: flights_sml <-
select(flights, year:day, ends_with("delay"), distance, air_time)
head(flights_sml)
tmp <- mutate(flights_sml, gain = arr_delay - dep_delay,
speed = distance / air_time * 60 )
head(tmp)
tmp <- mutate(flights_sml, gain = arr_delay - dep_delay,
hours = air_time / 60, gain_per_hour = gain / hours )
head(tmp)
transmute()
: tmp <- transmute(flights, gain = arr_delay - dep_delay,
hours = air_time / 60, gain_per_hour = gain / hours )
head(tmp)
Arithmetic operators: +
, -
, *
, /
, ^
: air_time / 60
, hours * 60 + minute
, x / sum(x)
, y - mean(y)
, etc.
Modular arithmetic: %/%
(integer division) and %%
(remainder).
tmp <- transmute(flights,
dep_time,
hour = dep_time %/% 100, # why divide by 100?
minute = dep_time %% 100
)
head(tmp)
Logs: log()
, log2()
, log10()
.
Offsets: lead()
and lag()
. Leading or lagging values.
(x <- 1:10)
## [1] 1 2 3 4 5 6 7 8 9 10
lag(x)
## [1] NA 1 2 3 4 5 6 7 8 9
lead(x)
## [1] 2 3 4 5 6 7 8 9 10 NA
cumsum()
, cumprod()
, cummin()
, cummax()
,dplyr::cummean()
. x
## [1] 1 2 3 4 5 6 7 8 9 10
cumsum(x)
## [1] 1 3 6 10 15 21 28 36 45 55
cummean(x)
## [1] 1.000000 1.000000 1.333333 1.750000 2.200000 2.666667 3.142857 3.625000
## [9] 4.111111 4.600000
Logical comparisons, <
, <=
, >
, >=
, !=
.
Ranking: start with min_rank()
(there are other ranking functions)
y <- c(1, 2, 2, NA, 3, 4)
min_rank(y)
## [1] 1 2 2 NA 4 5
min_rank(desc(y))
## [1] 5 3 3 NA 2 1
summarise()
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
by_day <- group_by(flights, year, month, day)
head(by_day) # grouping doesn't change how the data looks
# It changes how it acts with the other dplyr verbs
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) )
(delay <- filter(delay, count > 20, dest != "HNL"))
%>%
: (
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>%
filter(count > 20, dest != "HNL")
)
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data = delays, mapping = aes(x = dist, y = delay)) +
geom_point(aes(size = count), alpha = 1/3) + geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
mean(x)
, median(x)
. not_cancelled <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
sd(x)
, IQR(x)
, mad(x)
. not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
## `summarise()` ungrouping output (override with `.groups` argument)
min(x)
, quantile(x, 0.25)
, max(x)
. # When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
first(x)
, nth(x, 2)
, last(x)
. not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
n(x)
, sum(!is.na(x))
, n_distinct(x)
. # Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## `summarise()` ungrouping output (override with `.groups` argument)
not_cancelled %>%
count(dest)
not_cancelled %>%
count(tailnum, wt = distance) # total number of miles a plane flew:
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_perc = mean(arr_delay > 60))
## `summarise()` regrouping output by 'year', 'month' (override with `.groups` argument)
Combination of group_by()
with filter()
or mutate()
.
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 10000 )
dim(popular_dests)
## [1] 131440 19
head(popular_dests)