Lecture 10: importance of reducing duplication in your code by creating functions instead of copying-and-pasting.
Recall:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
Another tool: iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.
Two important iteration paradigms: - Imperative programming: loops - Functional programming: maps
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
Compute the median of each column:
median(df$a)
## [1] -0.4711801
median(df$b)
## [1] 0.08767172
median(df$c)
## [1] 0.1328946
median(df$d)
## [1] -0.1350515
This breaks our rule of thumb: never copy and paste more than twice. Use a for
loop:
output <- vector("double", ncol(df)) # 1. output
for (i in seq_along(df)) { # 2. sequence
output[[i]] <- median(df[[i]]) # 3. body
}
output
## [1] -0.47118007 0.08767172 0.13289461 -0.13505154
Three components of the for
loop:
Output: output <- vector("double", length(x))
. Allocate sufficient space for the output before you start the loop.
The vector()
function creates an empty vector of given length.
Sequence: i in seq_along(df)
. Determines what to loop over: each run of the for loop will assign i
to a different value from seq_along(df)
.
Always use seq_along()
instead of 1:length(l)
:
y <- vector("double", 0)
seq_along(y)
## integer(0)
1:length(y)
## [1] 1 0
Body: output[[i]] <- median(df[[i]])
. Code that does the work. Run repeatedly, each time with a different value for i
. The first iteration will run output[[1]] <- median(df[[1]])
, the second will run output[[2]] <- median(df[[2]])
, and so on.
Recall:
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
Output: we already have the output — it’s the same as the input!
Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df)
.
Body: apply rescale01()
.
for (i in seq_along(df)) {
df[[i]] <- rescale01(df[[i]])
}
Always try to use [[
, not [
all for loops: it makes it clear that I want to work with a single element.
Looping over the numeric indices with for (i in seq_along(xs))
, and extracting the value with x[[i]]
(learned above).
Loop over the elements: for (x in xs)
. Most useful if you only care about side-effects, like plotting or saving a file.
Loop over the names: for (nm in names(xs))
, and access the value with x[[nm]]
.
Useful if you want to use the name in a plot title or a file name.
results <- vector("list", length(x))
names(results) <- names(x)
Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:
for (i in seq_along(x)) {
name <- names(x)[[i]]
value <- x[[i]]
}
Suppose you want to simulate some random vectors of random lengths.
means <- c(0, 1, 2)
output <- double()
for (i in seq_along(means)) {
n <- sample(100, 1)
# progressively growing the vector `output`
output <- c(output, rnorm(n, means[[i]]))
}
str(output)
## num [1:66] 1.659 -0.583 0.553 -0.807 -0.133 ...
Better solution:
out <- vector("list", length(means))
for (i in seq_along(means)) {
n <- sample(100, 1)
out[[i]] <- rnorm(n, means[[i]])
}
str(out)
## List of 3
## $ : num [1:39] -0.381 -0.659 1.699 -0.677 -0.527 ...
## $ : num [1:9] 2.11 2.18 1.48 2.83 1.58 ...
## $ : num [1:79] 3.764 1.51 0.107 2.468 2.772 ...
str(unlist(out))
## num [1:127] -0.381 -0.659 1.699 -0.677 -0.527 ...
unlist()
flattens a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl()
.
A common pattern:
When generating a long string. Instead of paste()
ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with paste(output, collapse = "")
.
When generating a big data frame. Instead of sequentially rbind()
ing in each iteration, save the output in a list, then use dplyr::bind_rows(output)
to combine the output into a single data frame.
Whenever you see it, switch to a more complex result object, and then combine in one step at the end.
while
loopSuppose you want to loop until you get three heads in a row. You can’t do that sort of iteration with the for
loop. Instead, use the while
loop:
while (condition) {
# body
}
In fact you can rewrite any for loop as a while loop, but not the other way around:
for (i in seq_along(x)) {
# body
}
# Equivalent to
i <- 1
while (i <= length(x)) {
# body
i <- i + 1
}
The coin tossing simulation:
flip <- function() sample(c("T", "H"), 1)
flips <- 0
nheads <- 0
while (nheads < 3) {
if (flip() == "H") {
nheads <- nheads + 1
} else {
nheads <- 0
}
flips <- flips + 1
}
flips
## [1] 13
df <- tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
Compute the mean of every column by writing a function:
col_mean <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- mean(df[[i]])
}
output
}
What about the median and the standard deviation?
col_median <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- median(df[[i]])
}
output
}
col_sd <- function(df) {
output <- vector("double", length(df))
for (i in seq_along(df)) {
output[i] <- sd(df[[i]])
}
output
}
You’ve copied-and-pasted this code twice!
Actually you can pass a function to a function as an argument:
col_summary <- function(df, fun) {
out <- vector("double", length(df))
for (i in seq_along(df)) {
out[i] <- fun(df[[i]])
}
out
}
col_summary(df, mean)
## [1] -0.037674120 -0.263087064 -0.391312333 0.003005369
col_summary(df, median)
## [1] -0.03374963 -0.49921281 -0.39294445 0.31108586
col_summary(df, sd)
## [1] 0.8492516 0.6535011 0.8738210 1.2504671
We used the purrr package in the tidyverse:
The pattern:
purrr::map()
makes a list.purrr::map_lgl()
makes a logical vector.purrr::map_int()
makes an integer vector.purrr::map_dbl()
makes a double vector.purrr::map_chr()
makes a character vector.Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input.
A map_*()
version of col_summary()
:
map_dbl(df, mean)
## a b c d
## -0.037674120 -0.263087064 -0.391312333 0.003005369
map_dbl(df, median)
## a b c d
## -0.03374963 -0.49921281 -0.39294445 0.31108586
map_dbl(df, sd)
## a b c d
## 0.8492516 0.6535011 0.8738210 1.2504671
Compared to using a for loop, focus is on the operation being performed (i.e. mean()
, median()
, sd()
), not the bookkeeping required to loop over every element and store the output.
With pipes:
df %>% map_dbl(mean)
## a b c d
## -0.037674120 -0.263087064 -0.391312333 0.003005369
df %>% map_dbl(median)
## a b c d
## -0.03374963 -0.49921281 -0.39294445 0.31108586
df %>% map_dbl(sd)
## a b c d
## 0.8492516 0.6535011 0.8738210 1.2504671
Splits up the mtcars
dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:
models <- mtcars %>%
split(.$cyl) %>%
map(function(df) lm(mpg ~ wt, data = df))
Shortcut:
models <- mtcars %>%
split(.$cyl) %>%
map(~lm(mpg ~ wt, data = .))
The .
refers to the current list element .
Extract a summary statistic:
models %>%
map(summary) %>%
map_dbl(~.$r.squared)
## 4 6 8
## 0.5086326 0.4645102 0.4229655
Shortcut:
models %>%
map(summary) %>%
map_dbl("r.squared")
## 4 6 8
## 0.5086326 0.4645102 0.4229655
Select elements by position:
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2) # figure out why
## [1] 2 5 8
Base R has *apply()
functions:
lapply()
is basically identical to map()
, except that map()
is consistent with all the other functions in purrr, and you can use the shortcuts for functional arguments.
Base sapply()
is a wrapper around lapply()
that automatically simplifies the output. This is useful for interactive work but is problematic in a function because you never know what sort of output you’ll get:
x1 <- list(
c(0.27, 0.37, 0.57, 0.91, 0.20),
c(0.90, 0.94, 0.66, 0.63, 0.06),
c(0.21, 0.18, 0.69, 0.38, 0.77)
)
x2 <- list(
c(0.50, 0.72, 0.99, 0.38, 0.78),
c(0.93, 0.21, 0.65, 0.13, 0.27),
c(0.39, 0.01, 0.38, 0.87, 0.34)
)
threshold <- function(x, cutoff = 0.8) x[x > cutoff]
x1 %>% sapply(threshold) %>% str()
## List of 3
## $ : num 0.91
## $ : num [1:2] 0.9 0.94
## $ : num(0)
x2 %>% sapply(threshold) %>% str()
## num [1:3] 0.99 0.93 0.87
vapply()
is a safe alternative to sapply()
because you supply an additional argument that defines the type: vapply(df, is.numeric, logical(1))
is equivalent to map_lgl(df, is.numeric)
. vapply()
can also produce matrices — purrr’s map functions only ever produce vectors.
When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you’ll get an error message, and no output.
safely()
provides a convenient error handling:
safe_log <- safely(log)
str(safe_log(10))
## List of 2
## $ result: num 2.3
## $ error : NULL
str(safe_log("a"))
## List of 2
## $ result: NULL
## $ error :List of 2
## ..$ message: chr "non-numeric argument to mathematical function"
## ..$ call : language .Primitive("log")(x, base)
## ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
safely()
is designed to work with map:
x <- list(1, 10, "a")
y <- x %>% map(safely(log))
str(y)
## List of 3
## $ :List of 2
## ..$ result: num 0
## ..$ error : NULL
## $ :List of 2
## ..$ result: num 2.3
## ..$ error : NULL
## $ :List of 2
## ..$ result: NULL
## ..$ error :List of 2
## .. ..$ message: chr "non-numeric argument to mathematical function"
## .. ..$ call : language .Primitive("log")(x, base)
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
Transpose:
y <- y %>% purrr::transpose()
str(y)
## List of 2
## $ result:List of 3
## ..$ : num 0
## ..$ : num 2.3
## ..$ : NULL
## $ error :List of 3
## ..$ : NULL
## ..$ : NULL
## ..$ :List of 2
## .. ..$ message: chr "non-numeric argument to mathematical function"
## .. ..$ call : language .Primitive("log")(x, base)
## .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"
Rescue the successes:
is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
## [[1]]
## [1] "a"
y$result[is_ok] %>% flatten_dbl()
## [1] 0.000000 2.302585
Other options:
possibly()
: you give it a default value to return when there is an error.
x <- list(1, 10, "a")
x %>% map_dbl(possibly(log, NA_real_))
## [1] 0.000000 2.302585 NA
quietly()
performs a similar role to safely()
, but instead of capturing errors, it captures printed output, messages, and warnings:
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()
## List of 2
## $ :List of 4
## ..$ result : num 0
## ..$ output : chr ""
## ..$ warnings: chr(0)
## ..$ messages: chr(0)
## $ :List of 4
## ..$ result : num NaN
## ..$ output : chr ""
## ..$ warnings: chr "NaNs produced"
## ..$ messages: chr(0)
Suppose you want to simulate some random normals with different means.
Fixed standard deviation: use map()
:
mu <- list(5, 10, -3)
mu %>%
map(rnorm, n = 5) %>%
str()
## List of 3
## $ : num [1:5] 5.1 3.26 4.94 4.32 5.29
## $ : num [1:5] 8.93 10.2 10.65 9.63 9.36
## $ : num [1:5] -4.699 -3.136 -3.524 -2.733 -0.685
What if you also want to vary the standard deviation? Use map2()
which iterates over two vectors in parallel:
sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 5) %>% str()
## List of 3
## $ : num [1:5] 6.02 3.96 6.21 4.41 4.6
## $ : num [1:5] 7.34 10.75 4.58 10.92 16.41
## $ : num [1:5] -23.31 5.28 -22.32 3.12 -9.5
In a for
loop:
map2 <- function(x, y, f, ...) {
out <- vector("list", length(x))
for (i in seq_along(x)) {
out[[i]] <- f(x[[i]], y[[i]], ...)
}
out
}
More than two arguments, use pmap()
, which takes a list of arguments.
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>%
pmap(rnorm) %>%
str()
## List of 3
## $ : num 4.7
## $ : num [1:3] 16.78 8.58 8.25
## $ : num [1:5] -2.72 -4.32 -18.57 -7.04 -1.13
With names:
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>%
pmap(rnorm) %>%
str()
Since the arguments are all the same length, it makes sense to store them in a data frame:
params <- tribble(
~mean, ~sd, ~n,
5, 1, 1,
10, 5, 3,
-3, 10, 5
)
params %>%
pmap(rnorm)
## [[1]]
## [1] 6.516726
##
## [[2]]
## [1] 4.042988 11.213530 10.388460
##
## [[3]]
## [1] 1.1015247 10.8843908 -0.8897356 2.9763124 -3.4160460
You might also vary the function itself:
f <- c("runif", "rnorm", "rpois")
param <- list(
list(min = -1, max = 1),
list(sd = 5),
list(lambda = 10)
)
Use invoke_map()
:
invoke_map(f, param, n = 5) %>% str()
## List of 3
## $ : num [1:5] 0.821 0.303 -0.425 -0.99 0.541
## $ : num [1:5] -6.07 -4.89 -4.23 -4.09 2.75
## $ : int [1:5] 6 7 15 10 7
Again, you can use tribble()
: to make creating these matching pairs a little easier:
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = 1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sim = invoke_map(f, params, n = 10))
If you want to call a function for its side effects, rather than for its return value, use walk()
.
Typical application: render output to the screen or save files to disk
x <- list(1, "a", 3)
x %>%
walk(print)
## [1] 1
## [1] "a"
## [1] 3
There are also walk2()
or pwalk()
.
library(ggplot2)
plots <- mtcars %>%
split(.$cyl) %>%
map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")
pwalk(list(paths, plots), ggsave, path = tempdir())
## Saving 5 x 3.5 in image
## Saving 5 x 3.5 in image
## Saving 5 x 3.5 in image
walk()
, walk2()
and pwalk()
all invisibly return .x
, the first argument. This makes them suitable for use in the middle of pipelines.
The reduce function takes a “binary” function (i.e. a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left:
Reduce a list of data frames to a single data frame by joining the elements together:
dfs <- list(
age = tibble(name = "John", age = 30),
sex = tibble(name = c("John", "Mary"), sex = c("M", "F")),
trt = tibble(name = "Mary", treatment = "A")
)
dfs %>% reduce(full_join)
## Joining, by = "name"
## Joining, by = "name"
Find the intersection within a list of vectors:
vs <- list(
c(1, 3, 5, 6, 10),
c(1, 2, 3, 7, 8, 10),
c(1, 2, 3, 4, 8, 9, 10)
)
vs %>% reduce(intersect)
## [1] 1 3 10
Accumulate is similar but it keeps all the interim results. You could use it to implement a cumulative sum:
x <- sample(10)
x
## [1] 9 8 6 5 1 10 3 7 4 2
x %>% accumulate(`+`)
## [1] 9 17 23 28 29 39 42 49 53 55
But there is base::cumsum()
:
cumsum(x)
## [1] 9 17 23 28 29 39 42 49 53 55