Iteration

Lecture 10: importance of reducing duplication in your code by creating functions instead of copying-and-pasting.

Recall:

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

Another tool: iteration helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

Two important iteration paradigms: - Imperative programming: loops - Functional programming: maps

For loops

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

Compute the median of each column:

median(df$a)
## [1] -0.4711801
median(df$b)
## [1] 0.08767172
median(df$c)
## [1] 0.1328946
median(df$d)
## [1] -0.1350515

This breaks our rule of thumb: never copy and paste more than twice. Use a for loop:

output <- vector("double", ncol(df))  # 1. output
for (i in seq_along(df)) {            # 2. sequence
  output[[i]] <- median(df[[i]])      # 3. body
}
output
## [1] -0.47118007  0.08767172  0.13289461 -0.13505154

Three components of the for loop:

  1. Output: output <- vector("double", length(x)). Allocate sufficient space for the output before you start the loop.

    The vector() function creates an empty vector of given length.

  2. Sequence: i in seq_along(df). Determines what to loop over: each run of the for loop will assign i to a different value from seq_along(df).

    Always use seq_along() instead of 1:length(l):

    y <- vector("double", 0)
    seq_along(y)
    ## integer(0)
    1:length(y)
    ## [1] 1 0
  3. Body: output[[i]] <- median(df[[i]]). Code that does the work. Run repeatedly, each time with a different value for i. The first iteration will run output[[1]] <- median(df[[1]]), the second will run output[[2]] <- median(df[[2]]), and so on.

For loop variations

  1. Modifying an existing object, instead of creating a new object.
  2. Looping over names or values, instead of indices.
  3. Handling outputs of unknown length.
  4. Handling sequences of unknown length.

Modifying an existing object

Recall:

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
  1. Output: we already have the output — it’s the same as the input!

  2. Sequence: we can think about a data frame as a list of columns, so we can iterate over each column with seq_along(df).

  3. Body: apply rescale01().

for (i in seq_along(df)) {
  df[[i]] <- rescale01(df[[i]])
}

Always try to use [[, not [ all for loops: it makes it clear that I want to work with a single element.

Looping patterns

  1. Looping over the numeric indices with for (i in seq_along(xs)), and extracting the value with x[[i]] (learned above).

  2. Loop over the elements: for (x in xs). Most useful if you only care about side-effects, like plotting or saving a file.

  3. Loop over the names: for (nm in names(xs)), and access the value with x[[nm]].
    Useful if you want to use the name in a plot title or a file name.

    results <- vector("list", length(x))
    names(results) <- names(x)

Iteration over the numeric indices is the most general form, because given the position you can extract both the name and the value:

for (i in seq_along(x)) {
  name <- names(x)[[i]]
  value <- x[[i]]
}

Unknown output length

Suppose you want to simulate some random vectors of random lengths.

means <- c(0, 1, 2)

output <- double()
for (i in seq_along(means)) {
  n <- sample(100, 1)
  # progressively growing the vector `output`
  output <- c(output, rnorm(n, means[[i]]))
}
str(output)
##  num [1:66] 1.659 -0.583 0.553 -0.807 -0.133 ...
  • This is not very efficient: in each iteration, R has to copy all the data from the previous iterations.
  • You get “quadratic” (\(O(n^2)\)) behaviour which means that a loop with three times as many elements would take nine (\(3^2\)) times as long to run.

Better solution:

out <- vector("list", length(means))
for (i in seq_along(means)) {
  n <- sample(100, 1)
  out[[i]] <- rnorm(n, means[[i]])
}
str(out)
## List of 3
##  $ : num [1:39] -0.381 -0.659 1.699 -0.677 -0.527 ...
##  $ : num [1:9] 2.11 2.18 1.48 2.83 1.58 ...
##  $ : num [1:79] 3.764 1.51 0.107 2.468 2.772 ...
str(unlist(out))
##  num [1:127] -0.381 -0.659 1.699 -0.677 -0.527 ...

unlist() flattens a list of vectors into a single vector. A stricter option is to use purrr::flatten_dbl().

A common pattern:

  1. When generating a long string. Instead of paste()ing together each iteration with the previous, save the output in a character vector and then combine that vector into a single string with paste(output, collapse = "").

  2. When generating a big data frame. Instead of sequentially rbind()ing in each iteration, save the output in a list, then use dplyr::bind_rows(output) to combine the output into a single data frame.

Whenever you see it, switch to a more complex result object, and then combine in one step at the end.

Unknown sequence length: while loop

Suppose you want to loop until you get three heads in a row. You can’t do that sort of iteration with the for loop. Instead, use the while loop:

while (condition) {
  # body
}

In fact you can rewrite any for loop as a while loop, but not the other way around:

for (i in seq_along(x)) {
  # body
}

# Equivalent to
i <- 1
while (i <= length(x)) {
  # body
  i <- i + 1 
}

The coin tossing simulation:

flip <- function() sample(c("T", "H"), 1)

flips <- 0
nheads <- 0

while (nheads < 3) {
  if (flip() == "H") {
    nheads <- nheads + 1
  } else {
    nheads <- 0
  }
  flips <- flips + 1
}
flips
## [1] 13

For loops vs. functionals

df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

Compute the mean of every column by writing a function:

col_mean <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- mean(df[[i]])
  }
  output
}

What about the median and the standard deviation?

col_median <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- median(df[[i]])
  }
  output
}
col_sd <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[i] <- sd(df[[i]])
  }
  output
}

You’ve copied-and-pasted this code twice!

Actually you can pass a function to a function as an argument:

col_summary <- function(df, fun) {
  out <- vector("double", length(df))
  for (i in seq_along(df)) {
    out[i] <- fun(df[[i]])
  }
  out
}
col_summary(df, mean)
## [1] -0.037674120 -0.263087064 -0.391312333  0.003005369
col_summary(df, median)
## [1] -0.03374963 -0.49921281 -0.39294445  0.31108586
col_summary(df, sd)
## [1] 0.8492516 0.6535011 0.8738210 1.2504671

The map functions

We used the purrr package in the tidyverse:

The pattern:

  1. Loop over a vector.
  2. Do something to each element.
  3. Save the results.

Each function takes a vector as input, applies a function to each piece, and then returns a new vector that’s the same length (and has the same names) as the input.

A map_*() version of col_summary():

map_dbl(df, mean)
##            a            b            c            d 
## -0.037674120 -0.263087064 -0.391312333  0.003005369
map_dbl(df, median)
##           a           b           c           d 
## -0.03374963 -0.49921281 -0.39294445  0.31108586
map_dbl(df, sd)
##         a         b         c         d 
## 0.8492516 0.6535011 0.8738210 1.2504671

Compared to using a for loop, focus is on the operation being performed (i.e. mean(), median(), sd()), not the bookkeeping required to loop over every element and store the output.

With pipes:

df %>% map_dbl(mean)
##            a            b            c            d 
## -0.037674120 -0.263087064 -0.391312333  0.003005369
df %>% map_dbl(median)
##           a           b           c           d 
## -0.03374963 -0.49921281 -0.39294445  0.31108586
df %>% map_dbl(sd)
##         a         b         c         d 
## 0.8492516 0.6535011 0.8738210 1.2504671

Shortcuts

Anonymous function

Splits up the mtcars dataset into three pieces (one for each value of cylinder) and fits the same linear model to each piece:

models <- mtcars %>% 
  split(.$cyl) %>% 
  map(function(df) lm(mpg ~ wt, data = df))

Shortcut:

models <- mtcars %>% 
  split(.$cyl) %>% 
  map(~lm(mpg ~ wt, data = .))

The . refers to the current list element .

Extract a summary statistic:

models %>% 
  map(summary) %>% 
  map_dbl(~.$r.squared)
##         4         6         8 
## 0.5086326 0.4645102 0.4229655

Shortcut:

models %>% 
  map(summary) %>% 
  map_dbl("r.squared")
##         4         6         8 
## 0.5086326 0.4645102 0.4229655

Select elements by position:

x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)  # figure out why
## [1] 2 5 8

Base R

Base R has *apply() functions:

  • lapply() is basically identical to map(), except that map() is consistent with all the other functions in purrr, and you can use the shortcuts for functional arguments.

  • Base sapply() is a wrapper around lapply() that automatically simplifies the output. This is useful for interactive work but is problematic in a function because you never know what sort of output you’ll get:

    x1 <- list(
      c(0.27, 0.37, 0.57, 0.91, 0.20),
      c(0.90, 0.94, 0.66, 0.63, 0.06), 
      c(0.21, 0.18, 0.69, 0.38, 0.77)
    )
    x2 <- list(
      c(0.50, 0.72, 0.99, 0.38, 0.78), 
      c(0.93, 0.21, 0.65, 0.13, 0.27), 
      c(0.39, 0.01, 0.38, 0.87, 0.34)
    )
    
    threshold <- function(x, cutoff = 0.8) x[x > cutoff]
    x1 %>% sapply(threshold) %>% str()
    ## List of 3
    ##  $ : num 0.91
    ##  $ : num [1:2] 0.9 0.94
    ##  $ : num(0)
    x2 %>% sapply(threshold) %>% str()
    ##  num [1:3] 0.99 0.93 0.87
  • vapply() is a safe alternative to sapply() because you supply an additional argument that defines the type: vapply(df, is.numeric, logical(1)) is equivalent to map_lgl(df, is.numeric). vapply() can also produce matrices — purrr’s map functions only ever produce vectors.

Dealing with failure

When you use the map functions to repeat many operations, the chances are much higher that one of those operations will fail. When this happens, you’ll get an error message, and no output.

safely() provides a convenient error handling:

safe_log <- safely(log)
str(safe_log(10))
## List of 2
##  $ result: num 2.3
##  $ error : NULL
str(safe_log("a"))
## List of 2
##  $ result: NULL
##  $ error :List of 2
##   ..$ message: chr "non-numeric argument to mathematical function"
##   ..$ call   : language .Primitive("log")(x, base)
##   ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

safely() is designed to work with map:

x <- list(1, 10, "a")
y <- x %>% map(safely(log))
str(y)
## List of 3
##  $ :List of 2
##   ..$ result: num 0
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: num 2.3
##   ..$ error : NULL
##  $ :List of 2
##   ..$ result: NULL
##   ..$ error :List of 2
##   .. ..$ message: chr "non-numeric argument to mathematical function"
##   .. ..$ call   : language .Primitive("log")(x, base)
##   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Transpose:

y <- y %>% purrr::transpose()
str(y)
## List of 2
##  $ result:List of 3
##   ..$ : num 0
##   ..$ : num 2.3
##   ..$ : NULL
##  $ error :List of 3
##   ..$ : NULL
##   ..$ : NULL
##   ..$ :List of 2
##   .. ..$ message: chr "non-numeric argument to mathematical function"
##   .. ..$ call   : language .Primitive("log")(x, base)
##   .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"

Rescue the successes:

is_ok <- y$error %>% map_lgl(is_null)
x[!is_ok]
## [[1]]
## [1] "a"
y$result[is_ok] %>% flatten_dbl()
## [1] 0.000000 2.302585

Other options:

Mapping over multiple arguments

Suppose you want to simulate some random normals with different means.

Fixed standard deviation: use map():

mu <- list(5, 10, -3)
mu %>% 
  map(rnorm, n = 5) %>% 
  str()
## List of 3
##  $ : num [1:5] 5.1 3.26 4.94 4.32 5.29
##  $ : num [1:5] 8.93 10.2 10.65 9.63 9.36
##  $ : num [1:5] -4.699 -3.136 -3.524 -2.733 -0.685

What if you also want to vary the standard deviation? Use map2() which iterates over two vectors in parallel:

sigma <- list(1, 5, 10)
map2(mu, sigma, rnorm, n = 5) %>% str()
## List of 3
##  $ : num [1:5] 6.02 3.96 6.21 4.41 4.6
##  $ : num [1:5] 7.34 10.75 4.58 10.92 16.41
##  $ : num [1:5] -23.31 5.28 -22.32 3.12 -9.5

In a for loop:

map2 <- function(x, y, f, ...) {
  out <- vector("list", length(x))
  for (i in seq_along(x)) {
    out[[i]] <- f(x[[i]], y[[i]], ...)
  }
  out
}

More than two arguments, use pmap(), which takes a list of arguments.

n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>%
  pmap(rnorm) %>% 
  str()
## List of 3
##  $ : num 4.7
##  $ : num [1:3] 16.78 8.58 8.25
##  $ : num [1:5] -2.72 -4.32 -18.57 -7.04 -1.13

With names:

args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>% 
  pmap(rnorm) %>% 
  str()

Since the arguments are all the same length, it makes sense to store them in a data frame:

params <- tribble(
  ~mean, ~sd, ~n,
    5,     1,  1,
   10,     5,  3,
   -3,    10,  5
)
params %>% 
  pmap(rnorm)
## [[1]]
## [1] 6.516726
## 
## [[2]]
## [1]  4.042988 11.213530 10.388460
## 
## [[3]]
## [1]  1.1015247 10.8843908 -0.8897356  2.9763124 -3.4160460

Invoking different functions

You might also vary the function itself:

f <- c("runif", "rnorm", "rpois")
param <- list(
  list(min = -1, max = 1), 
  list(sd = 5), 
  list(lambda = 10)
)

Use invoke_map():

invoke_map(f, param, n = 5) %>% str()
## List of 3
##  $ : num [1:5] 0.821 0.303 -0.425 -0.99 0.541
##  $ : num [1:5] -6.07 -4.89 -4.23 -4.09 2.75
##  $ : int [1:5] 6 7 15 10 7

Again, you can use tribble(): to make creating these matching pairs a little easier:

sim <- tribble(
  ~f,      ~params,
  "runif", list(min = -1, max = 1),
  "rnorm", list(sd = 5),
  "rpois", list(lambda = 10)
)
sim %>% 
  mutate(sim = invoke_map(f, params, n = 10))

Walk

If you want to call a function for its side effects, rather than for its return value, use walk().

Typical application: render output to the screen or save files to disk

x <- list(1, "a", 3)

x %>% 
  walk(print)
## [1] 1
## [1] "a"
## [1] 3

There are also walk2() or pwalk().

library(ggplot2)
plots <- mtcars %>% 
  split(.$cyl) %>% 
  map(~ggplot(., aes(mpg, wt)) + geom_point())
paths <- stringr::str_c(names(plots), ".pdf")

pwalk(list(paths, plots), ggsave, path = tempdir())
## Saving 5 x 3.5 in image
## Saving 5 x 3.5 in image
## Saving 5 x 3.5 in image

walk(), walk2() and pwalk() all invisibly return .x, the first argument. This makes them suitable for use in the middle of pipelines.

Reduce and accumulate

The reduce function takes a “binary” function (i.e. a function with two primary inputs), and applies it repeatedly to a list until there is only a single element left:

Reduce a list of data frames to a single data frame by joining the elements together:

dfs <- list(
  age = tibble(name = "John", age = 30),
  sex = tibble(name = c("John", "Mary"), sex = c("M", "F")),
  trt = tibble(name = "Mary", treatment = "A")
)

dfs %>% reduce(full_join)
## Joining, by = "name"
## Joining, by = "name"

Find the intersection within a list of vectors:

vs <- list(
  c(1, 3, 5, 6, 10),
  c(1, 2, 3, 7, 8, 10),
  c(1, 2, 3, 4, 8, 9, 10)
)

vs %>% reduce(intersect)
## [1]  1  3 10

Accumulate is similar but it keeps all the interim results. You could use it to implement a cumulative sum:

x <- sample(10)
x
##  [1]  9  8  6  5  1 10  3  7  4  2
x %>% accumulate(`+`)
##  [1]  9 17 23 28 29 39 42 49 53 55

But there is base::cumsum():

cumsum(x)
##  [1]  9 17 23 28 29 39 42 49 53 55