Vectors are the objects that underlie tibbles and data frames.
Atomic vectors have 6 types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors.
Lists are sometimes called recursive vectors, because lists can contain other lists.
Atomic vectors are homogeneous, while lists can be heterogeneous.
NULL
NULL
is often used to represent the absence of a vectorNA
is used to represent the absence of a value in a vectorNULL
typically behaves like a vector of length 0.Type:
typeof(letters)
## [1] "character"
typeof(1:10)
## [1] "integer"
Length:
x <- list("a", "b", 1:10)
length(x)
## [1] 3
Take only three possible values: FALSE
, TRUE
, and NA
.
Usually constructed with comparison operators (see Lecture 3).
Manual creation:
1:10 %% 3 == 0
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
c(TRUE, TRUE, FALSE, NA)
## [1] TRUE TRUE FALSE NA
In R, numbers are doubles by default.
typeof(1)
## [1] "double"
To make an integer, place an L
after the number:
typeof(1L)
## [1] "integer"
1.5L # no effect
## [1] 1.5
```r
x <- sqrt(2) ^ 2 # recurring example
x
```
```
## [1] 2
```
```r
x - 2
```
```
## [1] 4.440892e-16
```
Instead of comparing floating point numbers using `==`, you should use `dplyr::near()` for some numerical tolerance.
Special values
NA
NA
, NaN
, Inf
and -Inf
.c(-1, 0, 1) / 0
## [1] -Inf NaN Inf
Again avoid using ==
to check for these other special values. Instead use is.finite()
, is.infinite()
, and is.nan()
:
0 | Inf | NA | NaN | |
---|---|---|---|---|
is.finite() |
O | |||
is.infinite() |
O | |||
is.na() |
O | O | ||
is.nan() |
O |
Each element of a character vector is a string, and a string can contain an arbitrary amount of data.
Each unique string is only stored in memory _once__
Every use of the string points to that representation.
This reduces the amount of memory needed by duplicated strings.
x <- "This is a reasonably long string."
pryr::object_size(x)
## Registered S3 method overwritten by 'pryr':
## method from
## print.bytes Rcpp
## 152 B
y <- rep(x, 1000)
pryr::object_size(y)
## 8.14 kB
y
doesn’t take up 1,000x as much memory as x
!
Two ways:
Explicit coercion: by calling as.logical()
, as.integer()
, as.double()
, or as.character()
, etc.
Implicit coercion: happens when you use a vector in a specific context that expects a certain type of vector. Examples:
From a logical vector to a numeric vector: case TRUE
is converted to 1
and FALSE
converted to 0
:
x <- sample(20, 100, replace = TRUE)
y <- x > 10
sum(y) # how many are greater than 10?
## [1] 50
mean(y) # what proportion are greater than 10?
## [1] 0.5
Implicit coercion from integer to logical:
if (length(x)) {
# do something
}
Be explicit: use length(x) > 0
.
Vector containing multiple types: the most complex type always wins.
typeof(c(TRUE, 1L))
## [1] "integer"
typeof(c(1L, 1.5))
## [1] "double"
typeof(c(1.5, "a"))
## [1] "character"
An atomic vector can__not__ have a mix of different types!
lgl | int | dbl | chr | list | |
---|---|---|---|---|---|
is_logical() |
O | ||||
is_integer() |
O | ||||
is_double() |
O | ||||
is_numeric() |
O | O | |||
is_character() |
O | ||||
is_atomic() |
O | O | O | O | |
is_list() |
O | ||||
is_vector() |
O | O | O | O | O |
Vector recycling: implicit coercion of the length of vectors
Most intuitive:
sample(10) + 100
## [1] 102 107 110 103 108 109 105 106 104 101
runif(10) > 0.5
## [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE
In R, basic mathematical operations work with vectors.
Unintuitive:
1:10 + 1:2
## [1] 2 4 4 6 6 8 8 10 10 12
Here, R will expand the shortest vector to the same length as the longest, so called recycling.
When the length of the longer is not an integer multiple of the length of the shorter:
1:10 + 1:3
## Warning in 1:10 + 1:3: longer object length is not a multiple of shorter object
## length
## [1] 2 4 6 5 7 9 8 10 12 11
The tidyverse way: if you do want to recycle, you’ll need to do it yourself with rep()
:
tibble(x = 1:4, y = 1:2)
## Error: Tibble columns must have compatible sizes.
## * Size 4: Existing data.
## * Size 2: Column `y`.
## ℹ Only values of size one are recycled.
tibble(x = 1:4, y = rep(1:2, 2))
tibble(x = 1:4, y = rep(1:2, each = 2))
tibble(x=1:4, y=1) # this is allowed
All types of vectors can be named:
c(x = 1, y = 2, z = 4)
## x y z
## 1 2 4
Or with purrr::set_names()
:
set_names(1:3, c("a", "b", "c"))
## a b c
## 1 2 3
[
: subsetting function for vectors, e.g., x[a]
cf. dplyr::filter()
for tibbles
A numeric vector containing only integers. The integers must either be all positive, all negative, or zero.
x <- c("one", "two", "three", "four", "five")
x[c(3, 2, 5)]
## [1] "three" "two" "five"
By repeating a position, you can actually make a longer output than input:
x[c(1, 1, 5, 5, 5, 2)]
## [1] "one" "one" "five" "five" "five" "two"
Negative values drop the elements at the specified positions:
x[c(-1, -3, -5)]
## [1] "two" "four"
It’s an error to mix positive and negative values:
x[c(1, -1)]
## Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts
Subsetting with zero:
x[0]
## character(0)
Subsetting with a logical vector:
x <- c(10, 3, NA, 5, 8, 1, NA)
# All non-missing values of x
x[!is.na(x)]
## [1] 10 3 5 8 1
# All even (or missing!) values of x
x[x %% 2 == 0]
## [1] 10 NA 8 NA
Subsetting a named vector:
x <- c(abc = 1, def = 2, xyz = 5)
x[c("xyz", "def")]
## xyz def
## 5 2
x[c("xyz", "def", "xyz")]
## xyz def xyz
## 5 2 5
Subsetting nothing: x[]
returns the complete x
. Useful when subsetting matrices:
x <- c(1, 2, 3)
x[]
## [1] 1 2 3
y <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2 )
y[1,]
## [1] 1 3 5
y[,-1]
## [,1] [,2]
## [1,] 3 5
## [2,] 4 6
x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
str(x) # `str` for structure
## List of 3
## $ : num 1
## $ : num 2
## $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
## $ a: num 1
## $ b: num 2
## $ c: num 3
Unlike atomic vectors, list()
can contain a mix of objects:
y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
## $ : chr "a"
## $ : int 1
## $ : num 1.5
## $ : logi TRUE
Lists can even contain other lists!
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
Visualisation:
Lists have rounded corners. Atomic vectors have square corners.
Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.
The orientation of the children (i.e. rows or columns) isn’t important, so I’ll pick a row or column orientation to either save space or illustrate an important property in the example.
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
[
extracts a sub-list. The result will always be a list.```r
str(a[1:2])
```
```
## List of 2
## $ a: int [1:3] 1 2 3
## $ b: chr "a string"
```
```r
str(a[4])
```
```
## List of 1
## $ d:List of 2
## ..$ : num -1
## ..$ : num -5
```
Like with vectors, you can subset with a logical, integer, or character vector.
[[
extracts a single component from a list. It removes a level of hierarchy from the list.```r
str(a[[1]])
```
```
## int [1:3] 1 2 3
```
```r
str(a[[4]])
```
```
## List of 2
## $ : num -1
## $ : num -5
```
$
is a shorthand for extracting named elements of a list. It works similarly to [[
except that you don’t need to use quotes.```r
a$a
```
```
## [1] 1 2 3
```
```r
a[["a"]]
```
```
## [1] 1 2 3
```
[
vs [[
Any vector can contain arbitrary additional metadata through its attributes. Attributes are like a named list of vectors that can be attached to any object.
x <- 1:10
attr(x, "greeting") # get an individual attribute
## NULL
attr(x, "greeting") <- "Hi!" # set an individual attribute
attr(x, "farewell") <- "Bye!" # set an individual attribute
attributes(x) # get all at once
## $greeting
## [1] "Hi!"
##
## $farewell
## [1] "Bye!"
Fundamental attributes:
Class controls how generic functions work
as.Date
## function (x, ...)
## UseMethod("as.Date")
## <bytecode: 0x7ff94bb22068>
## <environment: namespace:base>
The call to “UseMethod” means that this is a generic function, and it will call a specific method, a function, based on the class of the first argument.
All methods are functions; not all functions are methods.
List all the methods for a generic with methods()
:
methods("as.Date")
## [1] as.Date.character as.Date.default as.Date.factor
## [4] as.Date.numeric as.Date.POSIXct as.Date.POSIXlt
## [7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
## see '?methods' for accessing help and source code
If x
is a character vector, as.Date()
will call as.Date.character()
; if it’s a factor, it’ll call as.Date.factor()
.
Specific implementation of a method:
getS3method("as.Date", "default")
## function (x, ...)
## {
## if (inherits(x, "Date"))
## x
## else if (is.null(x))
## .Date(numeric())
## else if (is.logical(x) && all(is.na(x)))
## .Date(as.numeric(x))
## else stop(gettextf("do not know how to convert '%s' to class %s",
## deparse1(substitute(x)), dQuote("Date")), domain = NA)
## }
## <bytecode: 0x7ff94f00f7e8>
## <environment: namespace:base>
getS3method("as.Date", "numeric")
## function (x, origin, ...)
## {
## if (missing(origin)) {
## if (!length(x))
## return(.Date(numeric()))
## if (!any(is.finite(x)))
## return(.Date(x))
## stop("'origin' must be supplied")
## }
## as.Date(origin, ...) + x
## }
## <bytecode: 0x7ff94d04e6e0>
## <environment: namespace:base>
The most important S3 generic is print()
: it controls how the object is printed when you type its name at the console.
print
## function (x, ...)
## UseMethod("print")
## <bytecode: 0x7ff94ca45dc0>
## <environment: namespace:base>
methods("print") %>% head(50)
## [1] "print.acf"
## [2] "print.AES"
## [3] "print.all_vars"
## [4] "print.anova"
## [5] "print.ansi_string"
## [6] "print.ansi_style"
## [7] "print.any_vars"
## [8] "print.aov"
## [9] "print.aovlist"
## [10] "print.ar"
## [11] "print.Arima"
## [12] "print.arima0"
## [13] "print.AsIs"
## [14] "print.aspell"
## [15] "print.aspell_inspect_context"
## [16] "print.bibentry"
## [17] "print.Bibtex"
## [18] "print.boxx"
## [19] "print.browseVignettes"
## [20] "print.by"
## [21] "print.bytes"
## [22] "print.cache_info"
## [23] "print.cell_addr"
## [24] "print.cell_limits"
## [25] "print.changedFiles"
## [26] "print.check_code_usage_in_package"
## [27] "print.check_compiled_code"
## [28] "print.check_demo_index"
## [29] "print.check_depdef"
## [30] "print.check_details"
## [31] "print.check_details_changes"
## [32] "print.check_doi_db"
## [33] "print.check_dotInternal"
## [34] "print.check_make_vars"
## [35] "print.check_nonAPI_calls"
## [36] "print.check_package_code_assign_to_globalenv"
## [37] "print.check_package_code_attach"
## [38] "print.check_package_code_data_into_globalenv"
## [39] "print.check_package_code_startup_functions"
## [40] "print.check_package_code_syntax"
## [41] "print.check_package_code_unload_functions"
## [42] "print.check_package_compact_datasets"
## [43] "print.check_package_CRAN_incoming"
## [44] "print.check_package_datalist"
## [45] "print.check_package_datasets"
## [46] "print.check_package_depends"
## [47] "print.check_package_description"
## [48] "print.check_package_description_encoding"
## [49] "print.check_package_license"
## [50] "print.check_packages_in_dir"
Vectors with additional attributes:
Factors are built on top of integers, and have a levels attribute:
x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
## [1] "integer"
attributes(x)
## $levels
## [1] "ab" "cd" "ef"
##
## $class
## [1] "factor"
Dates in R are numeric vectors that represent the number of days since 1 January 1970:
x <- as.Date("1971-01-01")
unclass(x)
## [1] 365
typeof(x)
## [1] "double"
attributes(x)
## $class
## [1] "Date"
Date-times are numeric vectors with class POSIXct
that represent the number of seconds since 1 January 1970. (“POSIXct” stands for Portable Operating System Interface, calendar time.)
x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
## [1] 3600
## attr(,"tzone")
## [1] "UTC"
typeof(x)
## [1] "double"
attributes(x)
## $class
## [1] "POSIXct" "POSIXt"
##
## $tzone
## [1] "UTC"
tzone
controls how the time is printed:
attr(x, "tzone") <- "Asia/Seoul"
x
## [1] "1970-01-01 10:00:00 KST"
attr(x, "tzone") <- "Asia/Shanghai"
x
## [1] "1970-01-01 09:00:00 CST"
Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and names
(column) and row.names
attributes:
tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
## [1] "list"
attributes(tb)
## $names
## [1] "x" "y"
##
## $row.names
## [1] 1 2 3 4 5
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.
Traditional data.frames have a very similar structure:
df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
## [1] "list"
attributes(df)
## $names
## [1] "x" "y"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5
The main difference is the class. The class of tibble includes “data.frame” which means tibbles inherit the regular data frame behaviour by default.