Programming

Functions

Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.

Advantages over using copy-and-paste:

  1. You can give a function an evocative name that makes your code easier to understand.

  2. As requirements change, you only need to update code in one place, instead of many.

  3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

When should you write a function?

Whenever you’ve copied and pasted a block of code more than twice.

df <- tibble::tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

df$a <- (df$a - min(df$a, na.rm = TRUE)) / 
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / 
  (max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / 
  (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / 
  (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

Find an error.

How many inputs does it have?

(df$a - min(df$a, na.rm = TRUE)) /
  (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))

Rewrite the code using temporary variables with general names:

x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
##  [1] 1.00000000 0.51396385 0.69103229 0.25766928 0.45816223 0.65947996
##  [7] 0.41492718 0.06503273 0.35748915 0.00000000

Reduce duplication:

rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
##  [1] 1.00000000 0.51396385 0.69103229 0.25766928 0.45816223 0.65947996
##  [7] 0.41492718 0.06503273 0.35748915 0.00000000

Turn it into a function:

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
## [1] 0.0 0.5 1.0

3 components of a function:

  1. Name: here I’ve used rescale01 because this function rescales a vector to lie between 0 and 1.

  2. Arguments: list of inputs to the function. If we had more than one argument, the call would look like function(x, y, z).

  3. Body: code chunk inside the { block that immediately follows function(...).

Start with working code and turn it into a function; it’s harder to create a function and then try to make it work.

Check your function with a few different inputs:

rescale01(c(-10, 0, 10))
## [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
## [1] 0.00 0.25 0.50   NA 1.00

Original example with a function:

df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)

There still remains a bit of duplication; we’ll learn how to eliminate that duplication in iteration

If our requirements change, we only need to make the change in one place:

x <- c(1:10, Inf)
rescale01(x)
##  [1]   0   0   0   0   0   0   0   0   0   0 NaN

Fix just in one place:

rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000       Inf

DRY (Do not Repeat Yourself) principle

The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.

Functions are for humans and computers

Functions are not just for the computer, but are also for humans. Writing functions that humans can understand is very important for yourself (future-you), as well as communication.

Names

  1. Ideally, the name of your function will be short, but clearly evoke what the function does. It’s better to be clear than short.

  2. Generally, function names should be verbs, and arguments should be nouns. Exceptions:

    • Nouns are ok if the function computes a very well known noun (i.e. mean() is better than compute_mean()).
    • Accessing some property of an object (i.e. coef() is better than get_coefficients()).
    # Too short
    f()
    
    # Not a verb, or descriptive
    my_awesome_function()
    
    # Long, but clear
    impute_missing()
    collapse_years()
  3. Multiple words If your function name is composed of multiple words, I recommend using

    • snake_case()
    • camelCase()
    • Pick one or the other and stick with it.
    # Never do this!
    col_mins <- function(x, y) {}
    rowMaxes <- function(y, x) {}
  4. Family of functions

    • Make sure they have consistent names and arguments.
    • Use a common prefix to indicate that they are connected.
    # Good
    input_select()
    input_checkbox()
    input_text()
    
    # Not so good
    select_input()
    checkbox_input()
    text_input()
    • Example: stringr package (str_*())
  5. Avoid overriding existing functions and variables.

    # Don't do this!
    T <- FALSE
    c <- 10
    mean <- function(x) sum(x)

Comments

  • Lines starting with #, to explain the “why” of your code.

  • Long lines of - and = to break up your file into easily readable chunks.

    # Load data --------------------------------------
    
    # Plot data --------------------------------------

Conditional execution: if-else

if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}

A function that returns a logical vector describing whether or not each element of a vector is named:

has_name <- function(x) {
  nms <- names(x)
  if (is.null(nms)) {
    rep(FALSE, length(x))
  } else {
    !is.na(nms) & nms != ""
  }
}

Standard return rule: an R function returns the last value that it computed.

Conditions

Never do the following:

if (c(TRUE, FALSE)) {}
## Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
## first element will be used
## NULL
if (NA) {}
## Error in if (NA) {: missing value where TRUE/FALSE needed

Combining multiple logical expressions

  • Use || (or) and && (and) to combine multiple logical expressions.

  • Short-circuiting

    • as soon as || sees the first TRUE it returns TRUE without computing anything else.
    • as soon as && sees the first FALSE it returns FALSE.

Never use | or & in an if statement!

c(TRUE, FALSE, FALSE) | c(FALSE, TRUE, FALSE)
## [1]  TRUE  TRUE FALSE
  • These are vectorised operations that apply to multiple values (that’s why you use them in filter()).

  • If you do have a logical vector, use any() or all() to collapse it to a single value.

  • == is also vectorised: either check the length is already 1, collapse with all() or any(), or use the non-vectorised identical():

    identical(0L, 0) # very strict
    ## [1] FALSE

Floating point numbers

x <- sqrt(2) ^ 2
x
## [1] 2
x == 2
## [1] FALSE
x - 2
## [1] 4.440892e-16

Instead use dplyr::near() for comparisons, as described in Lecture 3.

Multiple conditions

You can chain multiple if statements together:

if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # 
}

Alternative: switch()

function(x, y, op) {
  switch(op,
    plus = x + y,
    minus = x - y,
    times = x * y,
    divide = x / y,
    stop("Unknown op!")
  )
}

Code style

# Good
if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

# Bad
if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

It’s ok to drop the curly braces if you have a very short if statement that can fit on one line:

y <- 10
x <- if (y < 20) "Too low" else "Too high"

Function arguments

Two sets of arguments:

  1. one set supplies the data to compute on (should come first).
  2. the other set supplies arguments that control the details of the computation.

Examples

  • In log(), the data is x, and the detail is the base of the logarithm.
  • In mean(), the data is x, and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm).
  • In t.test(), the data are x and y, and the details of the test are alternative, mu, paired, var.equal, and conf.level.
  • In str_c() you can supply any number of strings to ..., and the details of the concatenation are controlled by sep and collapse.

Default values

Detail arguments should go on the end, and usually should have default values.

# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)
mean_ci(x)
## [1] 0.4334768 0.5429520
mean_ci(x, conf = 0.99)
## [1] 0.4162770 0.5601518
  • Default values should almost always be the most common values.
  • Exception: na.rm = FALSE. It’s a bad idea to silently ignore missing values by default.

Conventions

  1. Typically omit the names of the data arguments

  2. If you override the default value of a detail argument, you should use the full name:

    # Good
    mean(1:10, na.rm = TRUE)
    
    # Bad
    mean(x = 1:10, , FALSE)
    mean(, TRUE, x = c(1:10, NA))
  3. Place a space around = in function calls, and always put a space after a comma, not before (just like in regular English).

    # Good
    average <- mean(feet / 12 + inches, na.rm = TRUE)
    
    # Bad
    average<-mean(feet/12+inches,na.rm=TRUE)

Common names

  • x, y, z: vectors.
  • w: a vector of weights.
  • df: a data frame.
  • i, j: numeric indices (typically rows and columns).
  • n: length, or number of rows.
  • p: number of columns.

Checking values

wt_mean <- function(x, w) {
  sum(x * w) / sum(w)
}
wt_mean(1:6, 1:3)  # why this result?
## [1] 7.666667

It’s good practice to check important preconditions, and throw an error (with stop()), if they are not true:

wt_mean <- function(x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(w * x) / sum(w)
}

or stopifnot():

wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
## Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE

Dot-dot-dot (…)

Many functions in R take an arbitrary number of inputs:

sum(1, 2, 3)
## [1] 6
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
## [1] 55

They rely on a special argument: ..., which captures any number of arguments that aren’t otherwise matched.

You can then send those ... on to another function:

commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
## [1] "a, b, c, d, e, f, g, h, i, j"

Return values

Explicit return statements

The value returned by the function is usually the last statement it evaluates.

You may also choose to return early by using return().

complicated_function <- function(x, y, z) {
  if (length(x) == 0 || length(y) == 0) {
    return(0)  # return early
  }
    
  # Complicated code here
}

Complex if. Instead of

f <- function() {
  if (x) {
    # Do 
    # something
    # that
    # takes
    # many
    # lines
    # to
    # express
  } else {
    # return something short
  }
}

do this:

f <- function() {
  if (!x) {
    return(something_short)
  }

  # Do 
  # something
  # that
  # takes
  # many
  # lines
  # to
  # express
}

Writing pipeable functions

A pipeable function should return a data frame.

  1. transformations: an object is passed to the function’s first argument and a modified object is returned.

  2. Functions with side-effects: the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file.

Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline:

show_missings <- function(df) {
  n <- sum(is.na(df))
  cat("Missing values: ", n, "\n", sep = "")
  
  invisible(df)
}
show_missings(mtcars)
## Missing values: 0
x <- show_missings(mtcars) 
## Missing values: 0
class(x)
## [1] "data.frame"
dim(x)
## [1] 32 11

We can still use it in a pipe:

mtcars %>% 
  show_missings() %>% 
  mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>% 
  show_missings() 
## Missing values: 0
## Missing values: 18

Environment

The environment of a function controls how R finds the value associated with a name:

f <- function(x) {
  x + y
} 
  • In many programming languages, this would be an error.

  • This is a valid R code because R uses lexical scoping to find the value associated with a name.

  • Since y is not defined inside the function, R will look in the environment where the function was defined:

    y <- 100
    f(10)
    ## [1] 110
    y <- 1000
    f(10)
    ## [1] 1010
  • You should avoid creating functions like this deliberately!

  • The advantage of this behaviour is that you can do many things that you can’t do in other programming languages:

    `+` <- function(x, y) {  # override `+`
      if (runif(1) < 0.1) {
        sum(x, y)
      } else {
        sum(x, y) * 1.1
      }
    }
    table(replicate(1000, 1 + 2))
    ## 
    ##   3 3.3 
    ##  89 911
    rm(`+`)  # remove overriden `+`
  • For more about functions and environments, read Ch. 7 of The Art of R Programming.