This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.0 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2 tools_3.5.0 htmltools_0.3.6 yaml_2.1.19
## [8] Rcpp_0.12.17 stringi_1.2.3 rmarkdown_1.10 knitr_1.20 stringr_1.3.1 digest_0.6.15 evaluate_0.10.1
Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?
Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs? etc.
Implementation:
Prototype functions, classes, partial documentation, etc.
Write unit tests
Implement code, run unit tests, debug
Broader testing, more debugging
Profile code, identify bottlenecks
Optimize code
Conduct experiments.
Full documentation.
After profiling, what to do to improve performance?
Ask: Are there obvious speedups? Are things being computed unnecessarily? Are you using a data.frame
where you should be using a matrix
etc.
Look up your problem (e.g., search for “lapply
slow” or “speeding up lapply
” etc.)
Try the just-in-time (JIT) compiler.
Consider re-writing some or all of the code in a compiled language (e.g., C/C++).
Try parallelization.
R typical execution:
compiler
package by Luke Tierney) which can increase the speed of some code. Using the compiler is an easy way to get improvements in speed.Brute-force for
loop for summing a vector:
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
Run the code on 1,000,000 elements:
library(microbenchmark)
library(ggplot2)
x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 43.77128 44.52665 46.11466 45.18986 46.39202 67.15905 100
Let’s compile the function into bytecode sum_rc
and benchmark again:
library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7f8aaf8fbdc8>
Benchmark again:
microbenchmark(sum_r(x), sum_rc(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 43.61191 46.09215 58.03362 51.46300 57.24237 209.0887 100
## sum_rc(x) 43.69873 46.56171 59.62327 51.25124 60.90177 236.6265 100
Surprisingly, compiling into bytecode does not help at all! Following code shows that the function sum_r
is already compiled into bytecode before execution.
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7f8aadea25e0>
Let’s turn off JIT (just-in-time compilation), re-define the (same) sum_r
function, and benchmark again:
enableJIT(0) # set JIT leval to 0
## [1] 3
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 324.3106 335.1662 362.6072 344.2958 373.4503 597.8557 100
Now we witness the slowness of the un-compiled sum_r
.
Documentation of enableJIT
:
enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.
Since R 3.4.0 (Apr 2017), the JIT bytecode compiler is enabled by default at its level 3.
If you create a package, then you automatically compile the package on installation by adding
ByteCompile: true
to the DESCRIPTION
file.
Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.
Learning sources:
- Advanced R: http://adv-r.had.co.nz/Rcpp.html
JIT compiler compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.
cppFunction
Rcpp
package provides a convenient way to embed C++ code in R code.
library(Rcpp)
cppFunction('double sum_c(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
sum_c
## function (x)
## .Call(<pointer: 0x109900610>, x)
Benchmark (1) compiled C++ function sum_c
together with (2) R function sum_r
, (3) compiled R function sum_rc
, and (4) the sum
function in base R:
mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm
## Unit: microseconds
## expr min lq mean median uq max neval
## sum_r(x) 318287.934 332699.480 350932.391 338366.125 349358.630 552131.519 100
## sum_rc(x) 43541.411 44116.060 45615.799 44813.865 46033.084 64826.778 100
## sum_c(x) 1238.881 1320.706 1351.207 1329.735 1365.646 1670.570 100
## sum(x) 949.047 1001.510 1066.391 1022.864 1121.369 1429.732 100
autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Remember we turned off JIT by enableGIT(0)
earlier.
sourceCpp
In realistic projects, we write standalone C++ files and then source them into R using sourceCpp()
. For example, consider sum.cpp
:
cat sum.cpp
## #include <Rcpp.h>
## using namespace Rcpp;
##
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## // http://www.rcpp.org/
## // http://adv-r.had.co.nz/Rcpp.html
## // http://gallery.rcpp.org/
## //
##
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
## int n = x.size();
## double total = 0;
## for(int i = 0; i < n; ++i) {
## total += x[i];
## }
## return total;
## }
##
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically
## // run after the compilation.
## //
##
## /*** R
## sum_c(as.double(1:10))
## */
Rcpp::sourceCpp()
parses the specified C++ file or source code:
sourceCpp("sum.cpp")
##
## > sum_c(as.double(1:10))
## [1] 55
sum_c
## function (x)
## .Call(<pointer: 0x109953610>, x)
Fact: base R is single-threaded. Even you request a fancy instance with 96 vCPUs, running R code is just using 1/96th of its power.
To perform multi-core computation in R:
Option 1: Manually run multiple R sessions.
Option 2: Make multiple system("Rscript")
calls. Typically automated by a scripting language (Python, Perl, shell script) or within R.
Option 3: Use package parallel
.
parallel
package in R.
Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
Included in base R since 2.14.0 (2011).
Based on the snow
(Luke Tierney) and multicore
(Simon Urbanek) packages.
To find the number of cores:
library(parallel)
detectCores()
## [1] 4
Let’s re-visit the simulation example considered in earlier lecture and HW1:
We have a “new” method that estimates the population mean by averaging the observations indexed by prime numbers.
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
n <- length(x)
ind <- sapply(1:n, isPrime)
return (mean(x[ind]))
}
We want to compare our method to the traditional sample average estimator by simulation studies.
## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
# set seed according to command argument `seed`
set.seed(seed)
# preallocate space to store estimators
msePrimeAvg <- 0.0
mseSamplAvg <- 0.0
# loop over simulation replicates
for (r in 1:reps) {
# simulate data according to command arguments `n` and `distr`
if (dist == "gaussian") {
x = rnorm(n)
} else if (dist == "t1") {
x = rcauchy(n)
} else if (dist == "t5") {
x = rt(n, 5)
} else {
stop(paste("unrecognized dist: ", dist))
}
# prime indexed mean estimator and classical sample average estimator
msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2
mseSamplAvg <- mseSamplAvg + mean(x)^2
}
mseSamplAvg <- mseSamplAvg / reps
msePrimeAvg <- msePrimeAvg / reps
return(c(mseSamplAvg, msePrimeAvg))
}
We need to loop over 3 generative models (distTypes
) and 20 samples sizes (nVals
). That are 60 “embarssingly parallel” tasks.
seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")
This is the serial code that double-loop over combinations of distTypes
and nVals
:
## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
for (dist in distTypes) {
for (n in nVals) {
simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
i <- i + 2
}
}
)
## user system elapsed
## 36.392 0.280 37.485
simres1
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
mcmapply
Run the same task using mcmapply
function (parallel analog of mapply
) in the parallel
package:
## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
simres2 <- mcmapply(compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
mc.cores = 4)
})
## user system elapsed
## 44.130 0.645 18.524
simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
We see roughly 2x-3x speedup with mc.cores=4
.
mcmapply
, mclapply
and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.
parLapply
, parApply
, parCapply
, parRapply
, clusterApply
, clusterMap
, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.
clusterMap
The same simulation example using clusterMap
function:
# Windows: use makePSOCKcluster()
cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
simres3 <- clusterMap(cl, compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
.scheduling = "dynamic")
})
## user system elapsed
## 0.024 0.006 12.271
simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
Again, we see roughly 2x-3x speedup by using 4 cores.
clusterExport
copies environment of master to slaves.
It is also possible to distribute computation over a network of computers (“cluster”).
Learning resources:
- Book _R Packages_by Hadley Wickham
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio