Acknowledgment

This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.

Working with text files

View/peek text files

  • cat prints the contents of a file:

    cat ~/.bash_profile
    ## #!/bin/sh
    ## if [ -f ~/.profile ]; then
    ##         source ~/.profile
    ## fi
    ## if [ -f ~/.bashrc ]; then
    ##         source ~/.bashrc
    ## fi
    ## 
    ## # added by Anaconda3 4.4.0 installer
    ## export PATH="/Users/jhwon/anaconda/bin:$PATH"

  • head -l prints the first \(l\) lines of a file:

    head -20 linux2.Rmd
    ## ---
    ## title: "Unix Basics II"
    ## author: "Joong-Ho Won @ SNU"
    ## date: '`r format(Sys.time(), "%B %d, %Y")`'
    ## output: 
    ##   html_document:
    ##     toc: true
    ## bibliography: ../bib-HZ.bib
    ## ---
    ## 
    ## ## Acknowledgment
    ## 
    ## This lecture note is based on [Dr. Hua Zhou](http://hua-zhou.github.io)'s 2018 Winter Statistical Computing course notes available at <http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html>.
    ## 
    ## # Working with text files
    ## 
    ## ## View/peek text files
    ## 
    ## - `cat` prints the contents of a file:
    ##     ```{bash, size='smallsize'}

  • tail -l prints the last \(l\) lines of a file:

    tail -20 linux2.Rmd
    ##     
    ## - Option 1: manually call `runSim.R` for each setting.
    ## 
    ## - Option 2: automate calls using R and `nohup`. [autoSim.R](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/autoSim.R)
    ## 
    ## -
    ##     ```{bash}
    ##     cat autoSim.R
    ##     ```
    ## 
    ## -
    ##     ```{bash}
    ##     Rscript autoSim.R
    ##     ```
    ## 
    ##     ```{bash, echo=FALSE, eval=TRUE}
    ##     rm n*.txt *.Rout
    ##     ```
    ##     
    ## - Now we just need write a script to collect results from the output files.

less is more; more is less

  • more browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the q key.

  • less is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.

  • less doesn’t need to read the whole file, i.e., it loads files faster than more.

grep

grep prints lines that match an expression:

  • Show lines that contain string CentOS:

    # quotes not necessary if not a regular expression
    grep 'CentOS' linux2.Rmd
    ## - Show lines that contain string `CentOS`:
    ##     grep 'CentOS' linux2.Rmd
    ##     grep 'CentOS' *.Rmd
    ##     grep -n 'CentOS' linux2.Rmd
    ## - Replace `CentOS` by `RHEL` in a text file:
    ##     sed 's/CentOS/RHEL/' linux2.Rmd | grep RHEL

  • Search multiple text files:

    grep 'CentOS' *.Rmd
    ## linux1.Rmd:- RHEL/CentOS is popular on servers.
    ## linux1.Rmd:- The teaching servers for this class run CentOS 7.
    ## linux1.Rmd:    CentOS Linux release 7.5.1804 (Core)
    ## linux1.Rmd:NAME="CentOS Linux"
    ## linux1.Rmd:PRETTY_NAME="CentOS Linux 7 (Core)"
    ## linux1.Rmd:CENTOS_MANTISBT_PROJECT="CentOS-7"
    ## linux1.Rmd:CentOS Linux release 7.5.1804 (Core)
    ## linux1.Rmd:CentOS Linux release 7.5.1804 (Core)
    ## linux2.Rmd:- Show lines that contain string `CentOS`:
    ## linux2.Rmd:    grep 'CentOS' linux2.Rmd
    ## linux2.Rmd:    grep 'CentOS' *.Rmd
    ## linux2.Rmd:    grep -n 'CentOS' linux2.Rmd
    ## linux2.Rmd:- Replace `CentOS` by `RHEL` in a text file:
    ## linux2.Rmd:    sed 's/CentOS/RHEL/' linux2.Rmd | grep RHEL

  • Show matching line numbers:

    grep -n 'CentOS' linux2.Rmd
    ## 50:- Show lines that contain string `CentOS`:
    ## 53:    grep 'CentOS' linux2.Rmd
    ## 60:    grep 'CentOS' *.Rmd
    ## 67:    grep -n 'CentOS' linux2.Rmd
    ## 86:- Replace `CentOS` by `RHEL` in a text file:
    ## 88:    sed 's/CentOS/RHEL/' linux2.Rmd | grep RHEL

  • Find all files in current directory with .png extension:

    ls | grep '\.png$'
    ## Richard_Stallman_2013.png
    ## key_authentication_1.png
    ## key_authentication_2.png
    ## linux_directory_structure.png
    ## linux_filepermission.png
    ## linux_filepermission_oct.png
    ## screenshot_top.png
  • Find all directories in the current directory:

    ls -al | grep '^d'
    ## drwxr-xr-x@ 23 jhwon  staff      782 Sep  2 22:17 .
    ## drwxr-xr-x@ 20 jhwon  staff      680 Aug 22 06:29 ..

sed

  • sed is a stream editor.

  • Replace CentOS by RHEL in a text file:

    sed 's/CentOS/RHEL/' linux2.Rmd | grep RHEL
    ## - Show lines that contain string `RHEL`:
    ##     grep 'RHEL' linux2.Rmd
    ##     grep 'RHEL' *.Rmd
    ##     grep -n 'RHEL' linux2.Rmd
    ## - Replace `RHEL` by `RHEL` in a text file:
    ##     sed 's/RHEL/RHEL/' linux2.Rmd | grep RHEL

awk

  • awk is a filter and report writer.

  • Print sorted list of login names:

    awk -F: '{ print $1 }' /etc/passwd | sort | head -5
    ## #
    ## # 
    ## # Note that this file is consulted directly only when the system is running
    ## # Open Directory.
    ## # Open Directory.

  • Print number of lines in a file, as NR stands for Number of Rows:

    awk 'END { print NR }' /etc/passwd
    ## 86

    or

    wc -l /etc/passwd
    ##       86 /etc/passwd

    or

    wc -l < /etc/passwd
    ##       86

  • Print login names with UID in range 1000-1035:

    awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd
  • Print login names and log-in shells in comma-seperated format:

    awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
    ## ##,
    ## # User Database,
    ## # ,
    ## # Note that this file is consulted directly only when the system is running,
    ## # in single-user mode.  At other times this information is provided by,
    ## # Open Directory.,
    ## #,
    ## # See the opendirectoryd(8) man page for additional information about,
    ## # Open Directory.,
    ## ##,
    ## nobody,/usr/bin/false
    ## root,/bin/sh
    ## daemon,/usr/bin/false
    ## _uucp,/usr/sbin/uucico
    ## _taskgated,/usr/bin/false
    ## _networkd,/usr/bin/false
    ## _installassistant,/usr/bin/false
    ## _lp,/usr/bin/false
    ## _postfix,/usr/bin/false
    ## _scsd,/usr/bin/false
    ## _ces,/usr/bin/false
    ## _mcxalr,/usr/bin/false
    ## _appleevents,/usr/bin/false
    ## _geod,/usr/bin/false
    ## _serialnumberd,/usr/bin/false
    ## _devdocs,/usr/bin/false
    ## _sandbox,/usr/bin/false
    ## _mdnsresponder,/usr/bin/false
    ## _ard,/usr/bin/false
    ## _www,/usr/bin/false
    ## _eppc,/usr/bin/false
    ## _cvs,/usr/bin/false
    ## _svn,/usr/bin/false
    ## _mysql,/usr/bin/false
    ## _sshd,/usr/bin/false
    ## _qtss,/usr/bin/false
    ## _cyrus,/usr/bin/false
    ## _mailman,/usr/bin/false
    ## _appserver,/usr/bin/false
    ## _clamav,/usr/bin/false
    ## _amavisd,/usr/bin/false
    ## _jabber,/usr/bin/false
    ## _appowner,/usr/bin/false
    ## _windowserver,/usr/bin/false
    ## _spotlight,/usr/bin/false
    ## _tokend,/usr/bin/false
    ## _securityagent,/usr/bin/false
    ## _calendar,/usr/bin/false
    ## _teamsserver,/usr/bin/false
    ## _update_sharing,/usr/bin/false
    ## _installer,/usr/bin/false
    ## _atsserver,/usr/bin/false
    ## _ftp,/usr/bin/false
    ## _unknown,/usr/bin/false
    ## _softwareupdate,/usr/bin/false
    ## _coreaudiod,/usr/bin/false
    ## _screensaver,/usr/bin/false
    ## _locationd,/usr/bin/false
    ## _trustevaluationagent,/usr/bin/false
    ## _timezone,/usr/bin/false
    ## _lda,/usr/bin/false
    ## _cvmsroot,/usr/bin/false
    ## _usbmuxd,/usr/bin/false
    ## _dovecot,/usr/bin/false
    ## _dpaudio,/usr/bin/false
    ## _postgres,/usr/bin/false
    ## _krbtgt,/usr/bin/false
    ## _kadmin_admin,/usr/bin/false
    ## _kadmin_changepw,/usr/bin/false
    ## _devicemgr,/usr/bin/false
    ## _webauthserver,/usr/bin/false
    ## _netbios,/usr/bin/false
    ## _warmd,/usr/bin/false
    ## _dovenull,/usr/bin/false
    ## _netstatistics,/usr/bin/false
    ## _avbdeviced,/usr/bin/false
    ## _krb_krbtgt,/usr/bin/false
    ## _krb_kadmin,/usr/bin/false
    ## _krb_changepw,/usr/bin/false
    ## _krb_kerberos,/usr/bin/false
    ## _krb_anonymous,/usr/bin/false
    ## _assetcache,/usr/bin/false
    ## _coremediaiod,/usr/bin/false
    ## _xcsbuildagent,/usr/bin/false
    ## _xcscredserver,/usr/bin/false
    ## _launchservicesd,/usr/bin/false

  • Print login names and indicate those with UID>1000 as vip:

    awk -F: -v status="" '{OFS = ","} 
    {if ($3 >= 1000) status="vip"; else status="regular"} 
    {print $1, status}' /etc/passwd
    ## ##,regular
    ## # User Database,regular
    ## # ,regular
    ## # Note that this file is consulted directly only when the system is running,regular
    ## # in single-user mode.  At other times this information is provided by,regular
    ## # Open Directory.,regular
    ## #,regular
    ## # See the opendirectoryd(8) man page for additional information about,regular
    ## # Open Directory.,regular
    ## ##,regular
    ## nobody,regular
    ## root,regular
    ## daemon,regular
    ## _uucp,regular
    ## _taskgated,regular
    ## _networkd,regular
    ## _installassistant,regular
    ## _lp,regular
    ## _postfix,regular
    ## _scsd,regular
    ## _ces,regular
    ## _mcxalr,regular
    ## _appleevents,regular
    ## _geod,regular
    ## _serialnumberd,regular
    ## _devdocs,regular
    ## _sandbox,regular
    ## _mdnsresponder,regular
    ## _ard,regular
    ## _www,regular
    ## _eppc,regular
    ## _cvs,regular
    ## _svn,regular
    ## _mysql,regular
    ## _sshd,regular
    ## _qtss,regular
    ## _cyrus,regular
    ## _mailman,regular
    ## _appserver,regular
    ## _clamav,regular
    ## _amavisd,regular
    ## _jabber,regular
    ## _appowner,regular
    ## _windowserver,regular
    ## _spotlight,regular
    ## _tokend,regular
    ## _securityagent,regular
    ## _calendar,regular
    ## _teamsserver,regular
    ## _update_sharing,regular
    ## _installer,regular
    ## _atsserver,regular
    ## _ftp,regular
    ## _unknown,regular
    ## _softwareupdate,regular
    ## _coreaudiod,regular
    ## _screensaver,regular
    ## _locationd,regular
    ## _trustevaluationagent,regular
    ## _timezone,regular
    ## _lda,regular
    ## _cvmsroot,regular
    ## _usbmuxd,regular
    ## _dovecot,regular
    ## _dpaudio,regular
    ## _postgres,regular
    ## _krbtgt,regular
    ## _kadmin_admin,regular
    ## _kadmin_changepw,regular
    ## _devicemgr,regular
    ## _webauthserver,regular
    ## _netbios,regular
    ## _warmd,regular
    ## _dovenull,regular
    ## _netstatistics,regular
    ## _avbdeviced,regular
    ## _krb_krbtgt,regular
    ## _krb_kadmin,regular
    ## _krb_changepw,regular
    ## _krb_kerberos,regular
    ## _krb_anonymous,regular
    ## _assetcache,regular
    ## _coremediaiod,regular
    ## _xcsbuildagent,regular
    ## _xcscredserver,regular
    ## _launchservicesd,regular

Piping and redirection

  • | sends output from one command as input of another command.

  • > directs output from one command to a file.

  • >> appends output from one command to a file.

  • < reads input from a file.

  • Combinations of shell commands (grep, sed, awk, …), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.

Text editors

Source: Editor War on Wikipedia.

Emacs

  • Emacs is a powerful text editor with extensive support for many languages including R, \(\LaTeX\), python, and C/C++; however it’s not installed by default on many Linux distributions.

  • Basic survival commands:
    • emacs filename to open a file with emacs.
    • CTRL-x CTRL-f to open an existing or new file.
    • CTRL-x CTRX-s to save.
    • CTRL-x CTRL-w to save as.
    • CTRL-x CTRL-c to quit.

  • Google emacs cheatsheet

C-<key> means hold the control key, and press <key>.
M-<key> means press the Esc key once, and press <key>.

vi

  • vi is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.

  • Basic survival commands:
    • vi filename to start editing a file.
    • vi is a modal editor: insert mode and normal mode. Pressing i switches from the normal mode to insert mode. Pressing ESC switches from the insert mode to normal mode.
    • :x<Return> quits vi and saves changes.
    • :q!<Return> quits vi without saving latest changes.
    • :w<Return> saves changes.
    • :wq<Return> quits vi and saves changes.

  • Google vi cheatsheet

Line breaks in text files

Processes

Processes

  • OS runs processes on behalf of user.

  • Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.

    ps
    ##   PID TTY           TIME CMD
    ## 11633 ttys001    0:00.06 /Applications/iTerm.app/Contents/MacOS/iTerm2 --server login -fp jhwon
    ## 11635 ttys001    0:00.57 -bash
    ## 35284 ttys002    0:00.03 /Applications/iTerm.app/Contents/MacOS/iTerm2 --server login -fp jhwon
    ## 35286 ttys002    0:00.05 -bash
    ## 37982 ttys002    0:00.05 ssh wonj@ryan.snu.ac.kr
    ## 38062 ttys003    0:00.04 /Applications/iTerm.app/Contents/MacOS/iTerm2 --server login -fp jhwon
    ## 38064 ttys003    0:00.04 -bash

  • All current running processes:

    ps -eaf | head 
    ##   UID   PID  PPID   C STIME   TTY           TIME CMD
    ##     0     1     0   0 Fri03PM ??         2:49.66 /sbin/launchd
    ##     0    44     1   0 Fri03PM ??         0:29.53 /usr/libexec/UserEventAgent (System)
    ##     0    45     1   0 Fri03PM ??         0:05.46 /usr/sbin/syslogd
    ##     0    47     1   0 Fri03PM ??         0:03.84 /System/Library/PrivateFrameworks/Uninstall.framework/Resources/uninstalld
    ##     0    48     1   0 Fri03PM ??         0:07.29 /usr/libexec/kextd
    ##     0    49     1   0 Fri03PM ??         0:38.79 /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/FSEvents.framework/Versions/A/Support/fseventsd
    ##     0    51     1   0 Fri03PM ??         0:08.22 /opt/cisco/anyconnect/bin/vpnagentd -execv_instance
    ##     0    52     1   0 Fri03PM ??         0:02.19 /System/Library/PrivateFrameworks/MediaRemote.framework/Support/mediaremoted
    ##    55    55     1   0 Fri03PM ??         0:00.77 /System/Library/CoreServices/appleeventsd --server

  • All Python processes:

    ps -eaf | grep python | tail -3
    ##   501 38380 38319   0 12:42AM ??         0:00.00 sh -c 'bash'  -c 'ps -eaf | grep python | tail -3' 2>&1
    ##   501 38381 38380   0 12:42AM ??         0:00.00 bash -c ps -eaf | grep python | tail -3
    ##   501 38383 38381   0 12:42AM ??         0:00.00 grep python
  • Process with PID=1:

    ps -fp 1
    ##   UID   PID  PPID   C STIME   TTY           TIME CMD
    ##     0     1     0   0 Fri03PM ??         2:49.66 /sbin/launchd

  • All processes owned by a user:

    ps -fu jhwon | head
    ##   UID   PID  PPID   C STIME   TTY           TIME CMD
    ##   501  1704     1   0 Fri03PM ??         0:13.52 /usr/sbin/cfprefsd agent
    ##   501  1707     1   0 Fri03PM ??         2:17.98 /usr/sbin/distnoted agent
    ##   501  8603     1   0 Fri03PM ??         0:08.29 /usr/libexec/UserEventAgent (Aqua)
    ##   501  8605     1   0 Fri03PM ??         0:08.86 /usr/sbin/universalaccessd launchd -s
    ##   501  8606     1   0 Fri03PM ??         0:10.05 /System/Library/CoreServices/ControlStrip.app/Contents/MacOS/TouchBarAgent
    ##   501  8608     1   0 Fri03PM ??         0:04.50 /usr/libexec/lsd
    ##   501  8609     1   0 Fri03PM ??         0:08.50 /System/Library/Frameworks/CoreTelephony.framework/Support/CommCenter -L
    ##   501  8610     1   0 Fri03PM ??         0:33.20 /usr/libexec/trustd --agent
    ##   501  8618     1   0 Fri03PM ??         0:39.37 /System/Library/CoreServices/ControlStrip.app/Contents/MacOS/ControlStrip

Kill processes

  • Kill process with PID=1001:

    kill 1001
  • Kill all R processes.

    killall -r R

top

  • top prints realtime process information (very useful).

    top

Secure shell (SSH)

SSH

SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.

  • Recall

    ssh username@your-teaching-server-name-or-ip-address

Use keys over password

  • Key authentication is more secure than password. Most passwords are weak.

  • Scripts or programs may need to systematically SSH into other machines.

  • Log into multiple machines using the same key.

  • Seamless use of many services: Git, svn, Amazon EC2 cloud service, parallel computing on multiple hosts, etc.

  • Many servers only allow key authentication and do not accept password authentication.

Key authentication


  • Public key. Put on the machine(s) you want to log in.

  • Private key. Put on your own computer. Consider this as the actual key in your pocket; never give to others.

  • Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.

  • Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).

Steps for generating keys

  • On Linux or Mac, to generate a key pair:

    ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
    • [KEY_FILENAME] is the name that you want to use for your SSH key files. Without the -f option, ssh-keygen generates a private key file id_rsa and a public key file id_rsa.pub in ~/.ssh, where ~ refers to your home directory. If the filename, e.g., my-ssh-key, is provided, then
      the private key file will be ~/.ssh/my-ssh-key and the public key file will be ~/.ssh/my-ssh-key.pub. In this case, you need to explicitly reference the key in the ssh command:

      ssh user@server -i ~/.ssh/my-ssh-key

      or use a configuration file, ~/.ssh/config like

      Host your-teaching-server-name-or-ip-address
      IdentityFile ~/.ssh/my-ssh-key
    • [USERNAME] is the user for whom you will apply this SSH key. Optional.

    • Use a (optional) paraphrase different from your password.

  • Set correct permissions on the .ssh folder and key files

    chmod 400 ~/.ssh/[KEY_FILENAME]

  • Append the public key to the ~/.ssh/authorized_keys file of any Linux machine we want to SSH to, e.g.,

    ssh-copy-id -i ~/.ssh/[KEY_FILENAME] username@your-teaching-server-name-or-ip-address
  • Test your new key.

    ssh -i ~/.ssh/[KEY_FILENAME] username@your-teaching-server-name-or-ip-address
  • If it still asks the password, log on and change file permisions and try again:

    chmod 0700 ~/.ssh
    chmod 600 ~/.ssh/authorized_keys
  • Now you don’t need password each time you connect from your machine to the teaching server.


  • If you set paraphrase when generating keys, you’ll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using ssh-agent on Linux/Mac or Pagent on Windows.

  • Same key pair can be used between any two machines. We don’t need to regenerate keys for each new connection.

  • For Windows users, the private key generated by ssh-keygen cannot be directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen use the converted private key. Read tutorial.

Transfer files between machines

  • scp securely transfers files between machines using SSH.

    ## copy file from local to remote
    scp localfile username@your-teaching-server-name-or-ip-address:/pathtofolder
    ## copy file from remote to local
    scp username@your-teaching-server-name-or-ip-address:/pathtofile pathtolocalfolder
  • sftp means the File Transfer Protocol via SSH.

  • GUIs for Windows (WinSCP) or Mac (Cyberduck).

  • (My preferred way) Use a version control system to sync project files between different machines and systems.

Running R in Linux

Interactive mode

  • Start R in the interactive mode by typing R in shell.

  • Then run R script by

    source("script.R")

Batch mode

  • Demo script meanEst.R implements an (terrible) estimator of mean \[ {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{x_i \text{ is prime}}}{\sum_{i=1}^n 1_{x_i \text{ is prime}}}. \]

    ## ## check if a given integer is prime
    ## isPrime = function(n) {
    ##   if (n <= 3) {
    ##     return (TRUE)
    ##   }
    ##   if (any((n %% 2:floor(sqrt(n))) == 0)) {
    ##     return (FALSE)
    ##   }
    ##   return (TRUE)
    ## }
    ## 
    ## ## estimate mean only using observation with prime indices
    ## estMeanPrimes = function (x) {
    ##   n = length(x)
    ##   ind = sapply(1:n, isPrime)
    ##   return (mean(x[ind]))
    ## }
    ## 
    ## print(estMeanPrimes(rnorm(100000)))

  • To run your R code non-interactively aka in batch mode, we have at least two options:

    # default output to meanEst.Rout
    R CMD BATCH meanEst.R

    or

    # output to stdout
    Rscript meanEst.R
  • Typically automate batch calls using a scripting language, e.g., Python, perl, and shell script.

Pass arguments to R scripts

  • Specify arguments in R CMD BATCH:

    R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
  • Specify arguments in Rscript:

    Rscript script.R mu=1 sig=2 kap=3
  • Parse command line arguments using magic formula

    for (arg in commandArgs(T)) {
      eval(parse(text=arg))
    }

    in R script. After calling the above code, all command line arguments will be available in the global namespace.


  • To understand the magic formula commandArgs, run R by:

    R '--args mu=1 sig=2 kap=3'

    and then issue commands in R

    commandArgs()
    commandArgs(TRUE)

  • Understand the magic formula parse and eval:

    rm(list=ls())
    print(x)
    ## Error in print(x): object 'x' not found
    parse(text="x=3")
    ## expression(x = 3)
    eval(parse(text="x=3"))
    print(x)
    ## [1] 3
  • runSim.R has components: (1) method implementation, (2) data generator with unspecified parameter n, (3) estimation based on generated data, and (4) command argument parser.

    ## ## parsing command arguments
    ## for (arg in commandArgs(TRUE)) {
    ##   eval(parse(text=arg))
    ## }
    ## 
    ## ## check if a given integer is prime
    ## isPrime = function(n) {
    ##   if (n <= 3) {
    ##     return (TRUE)
    ##   }
    ##   if (any((n %% 2:floor(sqrt(n))) == 0)) {
    ##     return (FALSE)
    ##   }
    ##   return (TRUE)
    ## }
    ## 
    ## ## estimate mean only using observation with prime indices
    ## estMeanPrimes = function (x) {
    ##   n = length(x)
    ##   ind = sapply(1:n, isPrime)
    ##   return (mean(x[ind]))
    ## }
    ## 
    ## # simulate data
    ## x = rnorm(n)
    ## 
    ## # estimate mean
    ## estMeanPrimes(x)

  • Call runSim.R with sample size n=100:

    R CMD BATCH '--args n=100' runSim.R

    or

    Rscript runSim.R n=100
    ## [1] 0.3949935

Run long jobs

  • Many statistical computing tasks take long: simulation, MCMC, etc.

nohup

  • nohup command in Linux runs program(s) immune to hangups and writes output to nohup.out by default. Logging out will not kill the process; we can log in later to check status and results.

  • nohup is POSIX standard thus available on Linux and MacOS.

  • Run runSim.R in background and writes output to nohup.out:

    nohup Rscript runSim.R n=100 &
    ## [1] 0.2775906

screen

  • screen is another popular utility, but not installed by default.

  • Typical workflow using screen.

    1. Access remote server using ssh.

    2. Start jobs in batch mode.

    3. Detach jobs.

    4. Exit from server, wait for jobs to finish.

    5. Access remote server using ssh.

    6. Re-attach jobs, check on progress, get results, etc.

Use R to call R

R in conjuction with nohup or screen can be used to orchestrate a large simulation study.

  • It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.

  • We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.

  • Python in many ways makes a better glue; we may discuss this later in the course.

  • Suppose we have
    • runSim.R which runs a simulation based on command line argument n.
    • A large collection of n values that we want to use in our simulation study.
    • Access to a server with 128 cores.
  • Option 1: manually call runSim.R for each setting.

  • Option 2: automate calls using R and nohup. autoSim.R

  • cat autoSim.R
    ## # autoSim.R
    ## 
    ## nVals = seq(100, 500, by=100)
    ## for (n in nVals) {
    ##   oFile = paste("n", n, ".txt", sep="")
    ##   arg = paste("n=", n, sep="")
    ##   sysCall = paste("nohup Rscript runSim.R ", arg, " > ", oFile)
    ##   system(sysCall)
    ##   print(paste("sysCall=", sysCall, sep=""))
    ## }
  • Rscript autoSim.R
    ## [1] "sysCall=nohup Rscript runSim.R  n=100  >  n100.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=200  >  n200.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=300  >  n300.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=400  >  n400.txt"
    ## [1] "sysCall=nohup Rscript runSim.R  n=500  >  n500.txt"
  • Now we just need write a script to collect results from the output files.