Vector basics

Vectors are the objects that underlie tibbles and data frames.

The hierarchy of R's vector types

The hierarchy of R’s vector types

  1. Atomic vectors have 6 types: logical, integer, double, character, complex, and raw. Integer and double vectors are collectively known as numeric vectors.

  2. Lists are sometimes called recursive vectors, because lists can contain other lists.

Atomic vectors are homogeneous, while lists can be heterogeneous.

NULL

  • NULL is often used to represent the absence of a vector
  • cf. NA is used to represent the absence of a value in a vector
  • NULL typically behaves like a vector of length 0.

Two key properties of the vector

  1. Type:

    typeof(letters)
    ## [1] "character"
    typeof(1:10)
    ## [1] "integer"
  2. Length:

    x <- list("a", "b", 1:10)
    length(x)
    ## [1] 3

Augmented vectors

  • Factors are built on top of integer vectors.
  • Dates and date-times are built on top of numeric vectors.
  • Data frames and tibbles are built on top of lists.

Important types of atomic vector

Logical

  • Take only three possible values: FALSE, TRUE, and NA.

  • Usually constructed with comparison operators (see Lecture 3).

  • Manual creation:

    1:10 %% 3 == 0
    ##  [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
    c(TRUE, TRUE, FALSE, NA)
    ## [1]  TRUE  TRUE FALSE    NA

Numeric

In R, numbers are doubles by default.

typeof(1)
## [1] "double"

To make an integer, place an L after the number:

typeof(1L)
## [1] "integer"
1.5L  # no effect
## [1] 1.5

Caution

  1. Doubles are approximations due to finite precision of the computer.
```r
x <- sqrt(2) ^ 2  # recurring example
x
```

```
## [1] 2
```

```r
x - 2
```

```
## [1] 4.440892e-16
```
Instead of comparing floating point numbers using `==`, you should use `dplyr::near()` for some numerical tolerance.
  1. Special values

    • Integers: NA
    • Doubles: NA, NaN, Inf and -Inf.
    c(-1, 0, 1) / 0
    ## [1] -Inf  NaN  Inf

    Again avoid using == to check for these other special values. Instead use is.finite(), is.infinite(), and is.nan():

    0 Inf NA NaN
    is.finite() O
    is.infinite() O
    is.na() O O
    is.nan() O

Character

Each element of a character vector is a string, and a string can contain an arbitrary amount of data.

Global string pool

  • Each unique string is only stored in memory _once__

  • Every use of the string points to that representation.

  • This reduces the amount of memory needed by duplicated strings.

    x <- "This is a reasonably long string."
    pryr::object_size(x)
    ## Registered S3 method overwritten by 'pryr':
    ##   method      from
    ##   print.bytes Rcpp
    ## 152 B
    y <- rep(x, 1000)
    pryr::object_size(y)
    ## 8.14 kB

    y doesn’t take up 1,000x as much memory as x!

Using atomic vectors

Coercion: how to convert from one type to another, and when that happens automatically

Two ways:

  1. Explicit coercion: by calling as.logical(), as.integer(), as.double(), or as.character(), etc.

  2. Implicit coercion: happens when you use a vector in a specific context that expects a certain type of vector. Examples:

    • when you use a logical vector with a numeric summary function
    • when you use a double vector where an integer vector is expected.

    From a logical vector to a numeric vector: case TRUE is converted to 1 and FALSE converted to 0:

    x <- sample(20, 100, replace = TRUE)
    y <- x > 10
    sum(y)  # how many are greater than 10?
    ## [1] 50
    mean(y) # what proportion are greater than 10?
    ## [1] 0.5

    Implicit coercion from integer to logical:

    if (length(x)) {
    # do something
    }

    Be explicit: use length(x) > 0.

Vector containing multiple types: the most complex type always wins.

typeof(c(TRUE, 1L))
## [1] "integer"
typeof(c(1L, 1.5))
## [1] "double"
typeof(c(1.5, "a"))
## [1] "character"

An atomic vector can__not__ have a mix of different types!

Test functions: how to tell if an object is a specific type of vector

lgl int dbl chr list
is_logical() O
is_integer() O
is_double() O
is_numeric() O O
is_character() O
is_atomic() O O O O
is_list() O
is_vector() O O O O O

Scalars and recycling rules: what happens when you work with vectors of different lengths

Vector recycling: implicit coercion of the length of vectors

Most intuitive:

sample(10) + 100
##  [1] 102 107 110 103 108 109 105 106 104 101
runif(10) > 0.5
##  [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE

In R, basic mathematical operations work with vectors.

Unintuitive:

1:10 + 1:2
##  [1]  2  4  4  6  6  8  8 10 10 12

Here, R will expand the shortest vector to the same length as the longest, so called recycling.

When the length of the longer is not an integer multiple of the length of the shorter:

1:10 + 1:3
## Warning in 1:10 + 1:3: longer object length is not a multiple of shorter object
## length
##  [1]  2  4  6  5  7  9  8 10 12 11

The tidyverse way: if you do want to recycle, you’ll need to do it yourself with rep():

tibble(x = 1:4, y = 1:2)
## Error: Tibble columns must have compatible sizes.
## * Size 4: Existing data.
## * Size 2: Column `y`.
## ℹ Only values of size one are recycled.
tibble(x = 1:4, y = rep(1:2, 2))
tibble(x = 1:4, y = rep(1:2, each = 2))
tibble(x=1:4, y=1)  # this is allowed

Naming vectors: how to name the elements of a vector

All types of vectors can be named:

c(x = 1, y = 2, z = 4)
## x y z 
## 1 2 4

Or with purrr::set_names():

set_names(1:3, c("a", "b", "c"))
## a b c 
## 1 2 3

Subsetting: how to pull out elements of interest

[: subsetting function for vectors, e.g., x[a] cf. dplyr::filter() for tibbles

  1. A numeric vector containing only integers. The integers must either be all positive, all negative, or zero.

    x <- c("one", "two", "three", "four", "five")
    x[c(3, 2, 5)]
    ## [1] "three" "two"   "five"

    By repeating a position, you can actually make a longer output than input:

    x[c(1, 1, 5, 5, 5, 2)]
    ## [1] "one"  "one"  "five" "five" "five" "two"

    Negative values drop the elements at the specified positions:

    x[c(-1, -3, -5)]
    ## [1] "two"  "four"

    It’s an error to mix positive and negative values:

    x[c(1, -1)]
    ## Error in x[c(1, -1)]: only 0's may be mixed with negative subscripts

    Subsetting with zero:

    x[0]
    ## character(0)
  2. Subsetting with a logical vector:

    x <- c(10, 3, NA, 5, 8, 1, NA)
    
    # All non-missing values of x
    x[!is.na(x)]
    ## [1] 10  3  5  8  1
    # All even (or missing!) values of x
    x[x %% 2 == 0]
    ## [1] 10 NA  8 NA
  3. Subsetting a named vector:

    x <- c(abc = 1, def = 2, xyz = 5)
    x[c("xyz", "def")]
    ## xyz def 
    ##   5   2
    x[c("xyz", "def", "xyz")]
    ## xyz def xyz 
    ##   5   2   5
  4. Subsetting nothing: x[] returns the complete x. Useful when subsetting matrices:

    x <- c(1, 2, 3)
    x[]
    ## [1] 1 2 3
    y <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2 )
    y[1,]
    ## [1] 1 3 5
    y[,-1]
    ##      [,1] [,2]
    ## [1,]    3    5
    ## [2,]    4    6

Lists

x <- list(1, 2, 3)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
str(x)  # `str` for structure
## List of 3
##  $ : num 1
##  $ : num 2
##  $ : num 3
x_named <- list(a = 1, b = 2, c = 3)
str(x_named)
## List of 3
##  $ a: num 1
##  $ b: num 2
##  $ c: num 3

Unlike atomic vectors, list() can contain a mix of objects:

y <- list("a", 1L, 1.5, TRUE)
str(y)
## List of 4
##  $ : chr "a"
##  $ : int 1
##  $ : num 1.5
##  $ : logi TRUE

Lists can even contain other lists!

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

Visualisation:

  1. Lists have rounded corners. Atomic vectors have square corners.

  2. Children are drawn inside their parent, and have a slightly darker background to make it easier to see the hierarchy.

  3. The orientation of the children (i.e. rows or columns) isn’t important, so I’ll pick a row or column orientation to either save space or illustrate an important property in the example.

Subsetting

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
  1. [ extracts a sub-list. The result will always be a list.
```r
str(a[1:2])
```

```
## List of 2
##  $ a: int [1:3] 1 2 3
##  $ b: chr "a string"
```

```r
str(a[4])
```

```
## List of 1
##  $ d:List of 2
##   ..$ : num -1
##   ..$ : num -5
```

Like with vectors, you can subset with a logical, integer, or character vector.
  1. [[ extracts a single component from a list. It removes a level of hierarchy from the list.
```r
str(a[[1]])
```

```
##  int [1:3] 1 2 3
```

```r
str(a[[4]])
```

```
## List of 2
##  $ : num -1
##  $ : num -5
```
  1. $ is a shorthand for extracting named elements of a list. It works similarly to [[ except that you don’t need to use quotes.
```r
a$a
```

```
## [1] 1 2 3
```

```r
a[["a"]]
```

```
## [1] 1 2 3
```

[ vs [[

Subsetting a list, visually.

Subsetting a list, visually.

Attributes

Any vector can contain arbitrary additional metadata through its attributes. Attributes are like a named list of vectors that can be attached to any object.

x <- 1:10
attr(x, "greeting")  # get an individual attribute
## NULL
attr(x, "greeting") <- "Hi!"  # set an individual attribute
attr(x, "farewell") <- "Bye!" # set an individual attribute
attributes(x)  # get all at once
## $greeting
## [1] "Hi!"
## 
## $farewell
## [1] "Bye!"

Fundamental attributes:

  1. Names: used to name the elements of a vector.
  2. Dimensions (dims, for short): make a vector behave like a matrix or array.
  3. Class: used to implement the S3 object oriented system.

Generic functions

Class controls how generic functions work

as.Date
## function (x, ...) 
## UseMethod("as.Date")
## <bytecode: 0x7ff94bb22068>
## <environment: namespace:base>
  • The call to “UseMethod” means that this is a generic function, and it will call a specific method, a function, based on the class of the first argument.

  • All methods are functions; not all functions are methods.

  • List all the methods for a generic with methods():

    methods("as.Date")
    ## [1] as.Date.character   as.Date.default     as.Date.factor     
    ## [4] as.Date.numeric     as.Date.POSIXct     as.Date.POSIXlt    
    ## [7] as.Date.vctrs_sclr* as.Date.vctrs_vctr*
    ## see '?methods' for accessing help and source code

    If x is a character vector, as.Date() will call as.Date.character(); if it’s a factor, it’ll call as.Date.factor().

Specific implementation of a method:

getS3method("as.Date", "default")
## function (x, ...) 
## {
##     if (inherits(x, "Date")) 
##         x
##     else if (is.null(x)) 
##         .Date(numeric())
##     else if (is.logical(x) && all(is.na(x))) 
##         .Date(as.numeric(x))
##     else stop(gettextf("do not know how to convert '%s' to class %s", 
##         deparse1(substitute(x)), dQuote("Date")), domain = NA)
## }
## <bytecode: 0x7ff94f00f7e8>
## <environment: namespace:base>
getS3method("as.Date", "numeric")
## function (x, origin, ...) 
## {
##     if (missing(origin)) {
##         if (!length(x)) 
##             return(.Date(numeric()))
##         if (!any(is.finite(x))) 
##             return(.Date(x))
##         stop("'origin' must be supplied")
##     }
##     as.Date(origin, ...) + x
## }
## <bytecode: 0x7ff94d04e6e0>
## <environment: namespace:base>

The most important S3 generic is print(): it controls how the object is printed when you type its name at the console.

print
## function (x, ...) 
## UseMethod("print")
## <bytecode: 0x7ff94ca45dc0>
## <environment: namespace:base>
methods("print") %>% head(50)
##  [1] "print.acf"                                   
##  [2] "print.AES"                                   
##  [3] "print.all_vars"                              
##  [4] "print.anova"                                 
##  [5] "print.ansi_string"                           
##  [6] "print.ansi_style"                            
##  [7] "print.any_vars"                              
##  [8] "print.aov"                                   
##  [9] "print.aovlist"                               
## [10] "print.ar"                                    
## [11] "print.Arima"                                 
## [12] "print.arima0"                                
## [13] "print.AsIs"                                  
## [14] "print.aspell"                                
## [15] "print.aspell_inspect_context"                
## [16] "print.bibentry"                              
## [17] "print.Bibtex"                                
## [18] "print.boxx"                                  
## [19] "print.browseVignettes"                       
## [20] "print.by"                                    
## [21] "print.bytes"                                 
## [22] "print.cache_info"                            
## [23] "print.cell_addr"                             
## [24] "print.cell_limits"                           
## [25] "print.changedFiles"                          
## [26] "print.check_code_usage_in_package"           
## [27] "print.check_compiled_code"                   
## [28] "print.check_demo_index"                      
## [29] "print.check_depdef"                          
## [30] "print.check_details"                         
## [31] "print.check_details_changes"                 
## [32] "print.check_doi_db"                          
## [33] "print.check_dotInternal"                     
## [34] "print.check_make_vars"                       
## [35] "print.check_nonAPI_calls"                    
## [36] "print.check_package_code_assign_to_globalenv"
## [37] "print.check_package_code_attach"             
## [38] "print.check_package_code_data_into_globalenv"
## [39] "print.check_package_code_startup_functions"  
## [40] "print.check_package_code_syntax"             
## [41] "print.check_package_code_unload_functions"   
## [42] "print.check_package_compact_datasets"        
## [43] "print.check_package_CRAN_incoming"           
## [44] "print.check_package_datalist"                
## [45] "print.check_package_datasets"                
## [46] "print.check_package_depends"                 
## [47] "print.check_package_description"             
## [48] "print.check_package_description_encoding"    
## [49] "print.check_package_license"                 
## [50] "print.check_packages_in_dir"

Augmented vectors

Vectors with additional attributes:

Factors

Factors are built on top of integers, and have a levels attribute:

x <- factor(c("ab", "cd", "ab"), levels = c("ab", "cd", "ef"))
typeof(x)
## [1] "integer"
attributes(x)
## $levels
## [1] "ab" "cd" "ef"
## 
## $class
## [1] "factor"

Dates and date-times

Dates in R are numeric vectors that represent the number of days since 1 January 1970:

x <- as.Date("1971-01-01")
unclass(x)
## [1] 365
typeof(x)
## [1] "double"
attributes(x)
## $class
## [1] "Date"

Date-times are numeric vectors with class POSIXct that represent the number of seconds since 1 January 1970. (“POSIXct” stands for Portable Operating System Interface, calendar time.)

x <- lubridate::ymd_hm("1970-01-01 01:00")
unclass(x)
## [1] 3600
## attr(,"tzone")
## [1] "UTC"
typeof(x)
## [1] "double"
attributes(x)
## $class
## [1] "POSIXct" "POSIXt" 
## 
## $tzone
## [1] "UTC"

tzone controls how the time is printed:

attr(x, "tzone") <- "Asia/Seoul"
x
## [1] "1970-01-01 10:00:00 KST"
attr(x, "tzone") <- "Asia/Shanghai"
x
## [1] "1970-01-01 09:00:00 CST"

Tibbles

Tibbles are augmented lists: they have class “tbl_df” + “tbl” + “data.frame”, and names (column) and row.names attributes:

tb <- tibble::tibble(x = 1:5, y = 5:1)
typeof(tb)
## [1] "list"
attributes(tb)
## $names
## [1] "x" "y"
## 
## $row.names
## [1] 1 2 3 4 5
## 
## $class
## [1] "tbl_df"     "tbl"        "data.frame"

The difference between a tibble and a list is that all the elements of a data frame must be vectors with the same length. All functions that work with tibbles enforce this constraint.

Traditional data.frames have a very similar structure:

df <- data.frame(x = 1:5, y = 5:1)
typeof(df)
## [1] "list"
attributes(df)
## $names
## [1] "x" "y"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5

The main difference is the class. The class of tibble includes “data.frame” which means tibbles inherit the regular data frame behaviour by default.