Acknowledgment

This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.

Session informnation for reproducibility:

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.0  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2 tools_3.5.0     htmltools_0.3.6 yaml_2.1.19    
##  [8] Rcpp_0.12.17    stringi_1.2.3   rmarkdown_1.10  knitr_1.20      stringr_1.3.1   digest_0.6.15   evaluate_0.10.1

Advanced R

To gain a deep understanding of how R works, the book Advanced R by Hadley Wickham is a must read. Read now to save numerous hours you might waste in future.

We cover select topics on coding style, benchmarking, profiling, debugging, parallel computing, byte code compiling, Rcpp, and package development.

Style

Benchmark

Sources:
- Advanced R: http://adv-r.had.co.nz/Performance.html
- Blog: http://www.alexejgossmann.com/benchmarking_r/

In order to identify performance issue, we need to measure runtime accurately.

system.time

set.seed(280)
x <- runif(1e6)

system.time({sqrt(x)})
##    user  system elapsed 
##   0.010   0.003   0.014
system.time({x ^ 0.5})
##    user  system elapsed 
##   0.039   0.001   0.041
system.time({exp(log(x) / 2)})
##    user  system elapsed 
##   0.024   0.000   0.025

From William Dunlap:

“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.

rbenchmark

library("rbenchmark")

benchmark(
  "sqrt(x)" = {sqrt(x)},
  "x^0.5" = {x ^ 0.5},
  "exp(log(x)/2)" = {exp(log(x) / 2)},
  replications = 100,
  order = "elapsed"
)
##            test replications elapsed relative user.self sys.self user.child sys.child
## 1       sqrt(x)          100   0.490    1.000     0.355    0.134          0         0
## 3 exp(log(x)/2)          100   1.952    3.984     1.813    0.128          0         0
## 2         x^0.5          100   2.870    5.857     2.705    0.142          0         0

relative is the ratio with the fastest one.

microbenchmark

library("microbenchmark")
library("ggplot2")

mbm <- microbenchmark(
  sqrt(x),
  x ^ 0.5,
  exp(log(x) / 2)
)
mbm
## Unit: milliseconds
##           expr       min        lq     mean    median       uq      max neval
##        sqrt(x)  2.163406  3.269818  4.06793  3.449744  3.62813 11.53134   100
##          x^0.5 23.911824 25.512446 27.57804 26.592014 29.06186 36.20914   100
##  exp(log(x)/2) 15.983428 17.290515 19.10205 18.008370 19.31630 28.99174   100

Results from microbenchmark can be nicely plotted in base R or ggplot2.

boxplot(mbm)

autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Profiling

Premature optimization is the root of all evil (or at least most of it) in programming.
-Don Knuth

Sources: - http://adv-r.had.co.nz/Profiling.html
- https://rstudio.github.io/profvis/
- https://support.rstudio.com/hc/en-us/articles/218221837-Profiling-with-RStudio

First example

library(profvis)

profvis({
  data(diamonds, package = "ggplot2")

  plot(price ~ carat, data = diamonds)
  m <- lm(price ~ carat, data = diamonds)
  abline(m, col = "red")
})

Example: profiling time

First generate test data:

times <- 4e5
cols <- 150
data <-
  as.data.frame(x = matrix(rnorm(times * cols, mean = 5),
                           ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)

Original code for centering columns of a dataframe:

profvis({
  # Store in another variable for this run
  data1 <- data
  
  # Get column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  
  # Subtract mean from each column
  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <-
      data1[, names(data1) != "id"][, i] - means[i]
  }
})

Profile apply vs colMeans vs lapply vs vapply:

profvis({
  data1 <- data
  # Four different ways of getting column means
  means <- apply(data1[, names(data1) != "id"], 2, mean)
  means <- colMeans(data1[, names(data1) != "id"])
  means <- lapply(data1[, names(data1) != "id"], mean)
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
})

We decide to use vapply:

profvis({
  data1 <- data
  means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))

  for (i in seq_along(means)) {
    data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
  }
})

Calculate mean and center in one pass:

profvis({
 data1 <- data
 
 # Given a column, normalize values and return them
 col_norm <- function(col) {
 col - mean(col)
 }
 
 # Apply the normalizer function over all columns except id
 data1[, names(data1) != "id"] <-
   lapply(data1[, names(data1) != "id"], col_norm)
})

Example: profiling memory

Original code for cumulative sums:

profvis({
  data <- data.frame(value = runif(1e5))

  data$sum[1] <- data$value[1]
  for (i in seq(2, nrow(data))) {
    data$sum[i] <- data$sum[i-1] + data$value[i]
  }
})

Write a function to avoid expensive indexing by $:

profvis({
  csum <- function(x) {
    if (length(x) < 2) return(x)

    sum <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    sum
  }
  data$sum <- csum(data$value)
})

Pre-allocate vector:

profvis({
  csum2 <- function(x) {
    if (length(x) < 2) return(x)

    sum <- numeric(length(x))  # Preallocate
    sum[1] <- x[1]
    for (i in seq(2, length(x))) {
      sum[i] <- sum[i-1] + x[i]
    }
    sum
  }
  data$sum <- csum2(data$value)
})

Advice

Modularize big projects into small functions. Profile functions as early and as frequently as possible.

Debugging

Learning sources:
- Video: https://vimeo.com/99375765
- Advanced R: http://adv-r.had.co.nz/Exceptions-Debugging.html
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio

Key concepts: