Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting.
Advantages over using copy-and-paste:
You can give a function an evocative name that makes your code easier to understand.
As requirements change, you only need to update code in one place, instead of many.
You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
Whenever you’ve copied and pasted a block of code more than twice.
df <- tibble::tibble(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10)
)
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
(max(df$b, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
(max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
(max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))
Find an error.
How many inputs does it have?
(df$a - min(df$a, na.rm = TRUE)) /
(max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
Rewrite the code using temporary variables with general names:
x <- df$a
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
## [1] 1.00000000 0.51396385 0.69103229 0.25766928 0.45816223 0.65947996
## [7] 0.41492718 0.06503273 0.35748915 0.00000000
Reduce duplication:
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
## [1] 1.00000000 0.51396385 0.69103229 0.25766928 0.45816223 0.65947996
## [7] 0.41492718 0.06503273 0.35748915 0.00000000
Turn it into a function:
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(c(0, 5, 10))
## [1] 0.0 0.5 1.0
3 components of a function:
Name: here I’ve used rescale01
because this function rescales a vector to lie between 0 and 1.
Arguments: list of inputs to the function. If we had more than one argument, the call would look like function(x, y, z)
.
Body: code chunk inside the {
block that immediately follows function(...)
.
Start with working code and turn it into a function; it’s harder to create a function and then try to make it work.
Check your function with a few different inputs:
rescale01(c(-10, 0, 10))
## [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
## [1] 0.00 0.25 0.50 NA 1.00
Original example with a function:
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
There still remains a bit of duplication; we’ll learn how to eliminate that duplication in iteration
If our requirements change, we only need to make the change in one place:
x <- c(1:10, Inf)
rescale01(x)
## [1] 0 0 0 0 0 0 0 0 0 0 NaN
Fix just in one place:
rescale01 <- function(x) {
rng <- range(x, na.rm = TRUE, finite = TRUE)
(x - rng[1]) / (rng[2] - rng[1])
}
rescale01(x)
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000 Inf
The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.
Functions are not just for the computer, but are also for humans. Writing functions that humans can understand is very important for yourself (future-you), as well as communication.
Ideally, the name of your function will be short, but clearly evoke what the function does. It’s better to be clear than short.
Generally, function names should be verbs, and arguments should be nouns. Exceptions:
mean()
is better than compute_mean()
).coef()
is better than get_coefficients()
).# Too short
f()
# Not a verb, or descriptive
my_awesome_function()
# Long, but clear
impute_missing()
collapse_years()
Multiple words If your function name is composed of multiple words, I recommend using
snake_case()
camelCase()
# Never do this!
col_mins <- function(x, y) {}
rowMaxes <- function(y, x) {}
Family of functions
# Good
input_select()
input_checkbox()
input_text()
# Not so good
select_input()
checkbox_input()
text_input()
str_*()
)Avoid overriding existing functions and variables.
# Don't do this!
T <- FALSE
c <- 10
mean <- function(x) sum(x)
if
-else
if (condition) {
# code executed when condition is TRUE
} else {
# code executed when condition is FALSE
}
A function that returns a logical vector describing whether or not each element of a vector is named:
has_name <- function(x) {
nms <- names(x)
if (is.null(nms)) {
rep(FALSE, length(x))
} else {
!is.na(nms) & nms != ""
}
}
Standard return rule: an R function returns the last value that it computed.
Never do the following:
if (c(TRUE, FALSE)) {}
## Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
## first element will be used
## NULL
if (NA) {}
## Error in if (NA) {: missing value where TRUE/FALSE needed
Use ||
(or) and &&
(and) to combine multiple logical expressions.
Short-circuiting
||
sees the first TRUE
it returns TRUE
without computing anything else.&&
sees the first FALSE
it returns FALSE
.|
or &
in an if
statement!c(TRUE, FALSE, FALSE) | c(FALSE, TRUE, FALSE)
## [1] TRUE TRUE FALSE
These are vectorised operations that apply to multiple values (that’s why you use them in filter()
).
If you do have a logical vector, use any()
or all()
to collapse it to a single value.
==
is also vectorised: either check the length is already 1, collapse with all()
or any()
, or use the non-vectorised identical()
:
identical(0L, 0) # very strict
## [1] FALSE
x <- sqrt(2) ^ 2
x
## [1] 2
x == 2
## [1] FALSE
x - 2
## [1] 4.440892e-16
Instead use dplyr::near()
for comparisons, as described in Lecture 3.
You can chain multiple if statements together:
if (this) {
# do that
} else if (that) {
# do something else
} else {
#
}
Alternative: switch()
function(x, y, op) {
switch(op,
plus = x + y,
minus = x - y,
times = x * y,
divide = x / y,
stop("Unknown op!")
)
}
# Good
if (y < 0 && debug) {
message("Y is negative")
}
if (y == 0) {
log(x)
} else {
y ^ x
}
# Bad
if (y < 0 && debug)
message("Y is negative")
if (y == 0) {
log(x)
}
else {
y ^ x
}
It’s ok to drop the curly braces if you have a very short if
statement that can fit on one line:
y <- 10
x <- if (y < 20) "Too low" else "Too high"
Two sets of arguments:
Examples
log()
, the data is x
, and the detail is the base
of the logarithm.mean()
, the data is x
, and the details are how much data to trim from the ends (trim
) and how to handle missing values (na.rm
).t.test()
, the data are x
and y
, and the details of the test are alternative
, mu
, paired
, var.equal
, and conf.level
.str_c()
you can supply any number of strings to ...
, and the details of the concatenation are controlled by sep
and collapse
.Detail arguments should go on the end, and usually should have default values.
# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
se <- sd(x) / sqrt(length(x))
alpha <- 1 - conf
mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}
x <- runif(100)
mean_ci(x)
## [1] 0.4334768 0.5429520
mean_ci(x, conf = 0.99)
## [1] 0.4162770 0.5601518
na.rm = FALSE
. It’s a bad idea to silently ignore missing values by default.Typically omit the names of the data arguments
If you override the default value of a detail argument, you should use the full name:
# Good
mean(1:10, na.rm = TRUE)
# Bad
mean(x = 1:10, , FALSE)
mean(, TRUE, x = c(1:10, NA))
Place a space around =
in function calls, and always put a space after a comma, not before (just like in regular English).
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
x
, y
, z
: vectors.w
: a vector of weights.df
: a data frame.i
, j
: numeric indices (typically rows and columns).n
: length, or number of rows.p
: number of columns.wt_mean <- function(x, w) {
sum(x * w) / sum(w)
}
wt_mean(1:6, 1:3) # why this result?
## [1] 7.666667
It’s good practice to check important preconditions, and throw an error (with stop()
), if they are not true:
wt_mean <- function(x, w) {
if (length(x) != length(w)) {
stop("`x` and `w` must be the same length", call. = FALSE)
}
sum(w * x) / sum(w)
}
or stopifnot()
:
wt_mean <- function(x, w, na.rm = FALSE) {
stopifnot(is.logical(na.rm), length(na.rm) == 1)
stopifnot(length(x) == length(w))
if (na.rm) {
miss <- is.na(x) | is.na(w)
x <- x[!miss]
w <- w[!miss]
}
sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
## Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE
Many functions in R take an arbitrary number of inputs:
sum(1, 2, 3)
## [1] 6
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
## [1] 55
They rely on a special argument: ...
, which captures any number of arguments that aren’t otherwise matched.
You can then send those ...
on to another function:
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
## [1] "a, b, c, d, e, f, g, h, i, j"
The value returned by the function is usually the last statement it evaluates.
You may also choose to return early by using return()
.
complicated_function <- function(x, y, z) {
if (length(x) == 0 || length(y) == 0) {
return(0) # return early
}
# Complicated code here
}
Complex if
. Instead of
f <- function() {
if (x) {
# Do
# something
# that
# takes
# many
# lines
# to
# express
} else {
# return something short
}
}
do this:
f <- function() {
if (!x) {
return(something_short)
}
# Do
# something
# that
# takes
# many
# lines
# to
# express
}
A pipeable function should return a data frame.
transformations: an object is passed to the function’s first argument and a modified object is returned.
Functions with side-effects: the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file.
Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline:
show_missings <- function(df) {
n <- sum(is.na(df))
cat("Missing values: ", n, "\n", sep = "")
invisible(df)
}
show_missings(mtcars)
## Missing values: 0
x <- show_missings(mtcars)
## Missing values: 0
class(x)
## [1] "data.frame"
dim(x)
## [1] 32 11
We can still use it in a pipe:
mtcars %>%
show_missings() %>%
mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>%
show_missings()
## Missing values: 0
## Missing values: 18
The environment of a function controls how R finds the value associated with a name:
f <- function(x) {
x + y
}
In many programming languages, this would be an error.
This is a valid R code because R uses lexical scoping to find the value associated with a name.
Since y
is not defined inside the function, R will look in the environment where the function was defined:
y <- 100
f(10)
## [1] 110
y <- 1000
f(10)
## [1] 1010
You should avoid creating functions like this deliberately!
The advantage of this behaviour is that you can do many things that you can’t do in other programming languages:
`+` <- function(x, y) { # override `+`
if (runif(1) < 0.1) {
sum(x, y)
} else {
sum(x, y) * 1.1
}
}
table(replicate(1000, 1 + 2))
##
## 3 3.3
## 89 911
rm(`+`) # remove overriden `+`
For more about functions and environments, read Ch. 7 of The Art of R Programming.
Comments
Lines starting with
#
, to explain the “why” of your code.Long lines of
-
and=
to break up your file into easily readable chunks.