Introduction

Make sure loading tidyverse:

    library("tidyverse")

Tibbles

Creating tibbles

  • Convert a regular data frame to tibble:

    # a regular data frame
    head(iris)

  • as_tibble(iris)
  • Convert a tibble to data frame:

    as.data.frame(tb)

  • Create a new tibble from individual vectors:

    tibble(
      x = 1:5, 
      y = 1, 
      z = x ^ 2 + y
    )

  • Transposed tibbles:

    tribble(
      ~x, ~y, ~z,
      #--|--|----
      "a", 2, 3.6,
      "b", 1, 8.5
    )

Printing a tibble

  • By default, tibble prints the first 10 rows and all columns that fit on screen.

    tibble(
      a = lubridate::now() + runif(1e3) * 86400,
      b = lubridate::today() + runif(1e3) * 30,
      c = 1:1e3,
      d = runif(1e3),
      e = sample(letters, 1e3, replace = TRUE)   # letters() == 'a' -- 'z'
    )

  • To change number of rows and columns to display:

    nycflights13::flights %>% 
      print(n = 10, width = Inf)
    ## # A tibble: 336,776 x 19
    ##     year month   day dep_time sched_dep_time dep_delay arr_time
    ##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
    ##  1  2013     1     1      517            515         2      830
    ##  2  2013     1     1      533            529         4      850
    ##  3  2013     1     1      542            540         2      923
    ##  4  2013     1     1      544            545        -1     1004
    ##  5  2013     1     1      554            600        -6      812
    ##  6  2013     1     1      554            558        -4      740
    ##  7  2013     1     1      555            600        -5      913
    ##  8  2013     1     1      557            600        -3      709
    ##  9  2013     1     1      557            600        -3      838
    ## 10  2013     1     1      558            600        -2      753
    ##    sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
    ##             <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
    ##  1            819        11 UA        1545 N14228  EWR    IAH        227
    ##  2            830        20 UA        1714 N24211  LGA    IAH        227
    ##  3            850        33 AA        1141 N619AA  JFK    MIA        160
    ##  4           1022       -18 B6         725 N804JB  JFK    BQN        183
    ##  5            837       -25 DL         461 N668DN  LGA    ATL        116
    ##  6            728        12 UA        1696 N39463  EWR    ORD        150
    ##  7            854        19 B6         507 N516JB  EWR    FLL        158
    ##  8            723       -14 EV        5708 N829AS  LGA    IAD         53
    ##  9            846        -8 B6          79 N593JB  JFK    MCO        140
    ## 10            745         8 AA         301 N3ALAA  LGA    ORD        138
    ##    distance  hour minute time_hour          
    ##       <dbl> <dbl>  <dbl> <dttm>             
    ##  1     1400     5     15 2013-01-01 05:00:00
    ##  2     1416     5     29 2013-01-01 05:00:00
    ##  3     1089     5     40 2013-01-01 05:00:00
    ##  4     1576     5     45 2013-01-01 05:00:00
    ##  5      762     6      0 2013-01-01 06:00:00
    ##  6      719     5     58 2013-01-01 05:00:00
    ##  7     1065     6      0 2013-01-01 06:00:00
    ##  8      229     6      0 2013-01-01 06:00:00
    ##  9      944     6      0 2013-01-01 06:00:00
    ## 10      733     6      0 2013-01-01 06:00:00
    ## # ... with 3.368e+05 more rows

  • To change the default print behaviour:
    • options(tibble.print_max = n, tibble.print_min = m): if more than m rows, print only n rows.
    • options(dplyr.print_min = Inf): print all row.
    • options(tibble.width = Inf): print all columns.

Subsetting

  • df <- tibble(
      x = runif(5),
      y = rnorm(5)
    )
  • Extract by name:

    df$x
    ## [1] 0.04646816 0.82877511 0.62831307 0.55478180 0.31797838
    df[["x"]]
    ## [1] 0.04646816 0.82877511 0.62831307 0.55478180 0.31797838

  • Extract by position:

    df[[1]]
    ## [1] 0.04646816 0.82877511 0.62831307 0.55478180 0.31797838
  • Pipe. Use the special placeholder .:

    df %>% .$x
    ## [1] 0.04646816 0.82877511 0.62831307 0.55478180 0.31797838
    df %>% .[["x"]]
    ## [1] 0.04646816 0.82877511 0.62831307 0.55478180 0.31797838

Data import

Let’s start working with our own data. We’ll learn how to read plain-text rectangular files into R.

Package readr

  • readr package implements functions that turn flat files into tibbles.

    • read_csv() (comma delimited files), read_csv2() (semicolon seperated files), read_tsv() (tab delimited files), read_delim() (files with any delimiter).

    • read_fwf() (fixed width files), read_table() (fixed width files where columns are separated by white space).

    • read_log() (Apache style log files).

  • An example file heights.csv is in this link:

    head ./data/heights.csv
    ## "earn","height","sex","ed","age","race"
    ## 50000,74.4244387818035,"male",16,45,"white"
    ## 60000,65.5375428255647,"female",16,58,"white"
    ## 30000,63.6291977374349,"female",16,29,"white"
    ## 50000,63.1085616752971,"female",16,91,"other"
    ## 51000,63.4024835710879,"female",17,39,"white"
    ## 9000,64.3995075440034,"female",15,26,"white"
    ## 29000,61.6563258264214,"female",12,49,"white"
    ## 32000,72.6985437364783,"male",17,46,"white"
    ## 2000,72.0394668497611,"male",15,21,"hispanic"

  • Read from a local file:

    (heights <- read_csv("data/heights.csv"))
    ## Parsed with column specification:
    ## cols(
    ##   earn = col_double(),
    ##   height = col_double(),
    ##   sex = col_character(),
    ##   ed = col_integer(),
    ##   age = col_integer(),
    ##   race = col_character()
    ## )

  • Read from inline csv file. Useful for experimenting with readr:

    read_csv("a,b,c
    1,2,3
    4,5,6")
  • Skip first n lines:

    read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)

  • Skip comment lines:

    read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")
  • No header line:

    read_csv("1,2,3\n4,5,6", col_names = FALSE)

    (\n is a convenient shortcut for adding a new line.)


  • No header line and specify col_names:

    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))
  • Specify the symbol representing missing values:

    read_csv("a,b,c\n1,2,.", na = ".")

Parsing a vector

  • parse_*() functions take a character vector and return a more specialised vector like a logical, integer, or date:

    str(parse_logical(c("TRUE", "FALSE", "NA")))
    ##  logi [1:3] TRUE FALSE NA
    str(parse_integer(c("1", "2", "3")))
    ##  int [1:3] 1 2 3
    str(parse_date(c("2010-01-01", "1979-10-14")))
    ##  Date[1:2], format: "2010-01-01" "1979-10-14"
  • Missing values:

    parse_integer(c("1", "231", ".", "456"), na = ".")
    ## [1]   1 231  NA 456
  • If parsing fails, you’ll get a warning and the failures will be missing in the output:

    x <- parse_integer(c("123", "345", "abc", "123.45"))
    ## Warning in rbind(names(probs), probs_f): number of columns of result is not
    ## a multiple of vector length (arg 1)
    ## Warning: 2 parsing failures.
    ## row # A tibble: 2 x 4 col     row   col expected               actual expected   <int> <int> <chr>                  <chr>  actual 1     3    NA an integer             abc    row 2     4    NA no trailing characters .45
    x
    ## [1] 123 345  NA  NA
    ## attr(,"problems")
    ## # A tibble: 2 x 4
    ##     row   col expected               actual
    ##   <int> <int> <chr>                  <chr> 
    ## 1     3    NA an integer             abc   
    ## 2     4    NA no trailing characters .45

Numbers

parse_double() and parse_number().

  • # some countries use `.` in between the integer and fractional 
    # parts of a real number, while others use `,`.
    parse_double("1.23")
    ## [1] 1.23
    parse_double("1,23", locale = locale(decimal_mark = ","))
    ## [1] 1.23
  • # Numbers are often surrounded by other characters that provide some context
    parse_number("$100")
    ## [1] 100
    parse_number("20%")
    ## [1] 20
    parse_number("It cost $123.45")
    ## [1] 123.45
  • # “Grouping” characters to make numbers easier to read, like “1,000,000”, 
    # vary around the world.
    parse_number("$123,456,789") # Used in America
    ## [1] 123456789
    parse_number("123.456.789", locale = locale(grouping_mark = ".")) # Used in many parts of Europe
    ## [1] 123456789
    parse_number("123'456'789", locale = locale(grouping_mark = "'")) # Used in Switzerland
    ## [1] 123456789

Strings

Complications: there are multiple ways to represent the same string.

  • charToRaw("Hadley")
    ## [1] 48 61 64 6c 65 79
    The mapping from hexadecimal number to character is called the encoding.

Modern strings are almost always encoded in UTF-8, but there still are data produced by older systems that don’t understand UTF-8:

  • x1 <- "El Ni\xf1o was particularly bad this year"   # Latin1 (aka ISO-8859-1)
    x2 <- "\xbe\xc8\xb3\xe7\xc7\xcf\xbc\xbc\xbf\xe4"    # EUC-KR (Korean)
    
    x1
    ## [1] "El Ni\xf1o was particularly bad this year"
    x2
    ## [1] "\xbeȳ\xe7\xc7\u03fc\xbc\xbf\xe4"

To fix the problem you need to specify the encoding in parse_character():

  • parse_character(x1, locale = locale(encoding = "Latin1"))
    ## [1] "El Niño was particularly bad this year"
    parse_character(x2, locale = locale(encoding = "EUC-KR"))
    ## [1] "안녕하세요"

Factors

R uses factors to represent categorical variables that have a known set of possible values:

  • fruit <- c("apple", "banana")
    parse_factor(c("apple", "banana", "bananana"), levels = fruit)
    ## Warning: 1 parsing failure.
    ## row # A tibble: 1 x 4 col     row   col expected           actual   expected   <int> <int> <chr>              <chr>    actual 1     3    NA value in level set bananana
    ## [1] apple  banana <NA>  
    ## attr(,"problems")
    ## # A tibble: 1 x 4
    ##     row   col expected           actual  
    ##   <int> <int> <chr>              <chr>   
    ## 1     3    NA value in level set bananana
    ## Levels: apple banana

Dates, date-times, and times

Computer representation of time: 1. date (the number of days since 1970-01-01) 2. date-time (the number of seconds since midnight 1970-01-01) 3. time (the number of seconds since midnight 1970-01-01)

  • parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

    parse_datetime("2010-10-01T2010")
    ## [1] "2010-10-01 20:10:00 UTC"
    # If time is omitted, it will be set to midnight
    parse_datetime("20101010")
    ## [1] "2010-10-10 UTC"
  • parse_date() expects a four digit year, a - or /, the month, a - or /, then the day:

    parse_date("2010-10-01")
    ## [1] "2010-10-01"
  • parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:

    library(hms)
    parse_time("01:10 am")
    ## 01:10:00
    parse_time("20:10:01")
    ## 20:10:01

Parsing a file

Challenges

  1. How readr automatically guesses the type of each column.
  2. How to override the default specification.

Strategy

Heuristic used by readr: it reads the first 1000 rows and tries each of the following types, stopping when it finds a match:

  • logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
  • integer: contains only numeric characters (and -).
  • double: contains only valid doubles (including numbers like 4.5e-5).
  • number: contains valid doubles with the grouping mark inside.
  • time: matches the default time_format.
  • date: matches the default date_format.
  • date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.

Problems

  1. First 1000 rows are not enough.

  2. Missing values.

  • A challenging case:

    challenge <- read_csv(readr_example("challenge.csv"))
    ## Parsed with column specification:
    ## cols(
    ##   x = col_integer(),
    ##   y = col_character()
    ## )
    ## Warning in rbind(names(probs), probs_f): number of columns of result is not
    ## a multiple of vector length (arg 1)
    ## Warning: 1000 parsing failures.
    ## row # A tibble: 5 x 5 col     row col   expected       actual      file                              expected   <int> <chr> <chr>          <chr>       <chr>                             actual 1  1001 x     no trailing c… .238379750… '/Library/Frameworks/R.framework… file 2  1002 x     no trailing c… .411679971… '/Library/Frameworks/R.framework… row 3  1003 x     no trailing c… .746071676… '/Library/Frameworks/R.framework… col 4  1004 x     no trailing c… .723450553… '/Library/Frameworks/R.framework… expected 5  1005 x     no trailing c… .614524137… '/Library/Frameworks/R.framework…
    ## ... ................. ... .......................................................................... ........ .......................................................................... ...... .......................................................................... .... .......................................................................... ... .......................................................................... ... .......................................................................... ........ ..........................................................................
    ## See problems(...) for more details.
  • Let’s identify the problem:

    problems(challenge)

A good strategy is to work column by column until there are no problems remaining.

  • Fix the first problem:

    challenge <- read_csv(
      readr_example("challenge.csv"), 
      col_types = cols(
    x = col_double(),
    y = col_character()
      )
    )
  • But the y appears to be dates stored in a character vector:

    tail(challenge)
  • Fix this:

    challenge <- read_csv(
      readr_example("challenge.csv"), 
      col_types = cols(
    x = col_double(),
    y = col_date()
      )
    )
    tail(challenge)

Other strategies

  1. Look at a few more rows:

    challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
    ## Parsed with column specification:
    ## cols(
    ##   x = col_double(),
    ##   y = col_date(format = "")
    ## )
    challenge2
  2. Read in all the columns as character vectors:

    challenge2 <- read_csv(readr_example("challenge.csv"), 
      col_types = cols(.default = col_character())
    )

    and apply type_convert():

    df <- tribble(
      ~x,  ~y,
      "1", "1.21",
      "2", "2.32",
      "3", "4.56"
    )
    df
    # Note the column types
    type_convert(df)
    ## Parsed with column specification:
    ## cols(
    ##   x = col_integer(),
    ##   y = col_double()
    ## )

Writing to a file

  • Write to csv:

    write_csv(challenge, "challenge.csv")
  • Write (and read) an RDS file, a custom binary format that preserves data types:

    write_rds(challenge, "challenge.rds")
    read_rds("challenge.rds")

Other types of data

  • haven reads SPSS, Stata, and SAS files.

  • readxl reads excel files (both .xls and .xlsx).

  • DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.

  • jsonlite reads JSON files; JSON is a standard data exchange format on the web.

  • xml2 reads XML files.

Tidy data

Tidy data

There are three interrelated rules which make a dataset tidy:

  • Each variable must have its own column.

  • Each observation must have its own row.

  • Each value must have its own cell.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.


  • Example table1

    table1

    is tidy.


  • Example table2

    table2

    is not tidy.


  • Example table3

    table3

    is not tidy.


  • Example table4a

    table4a

    is not tidy.

  • Example table4b

    table4b

    is not tidy.

Why tidy data?

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>% 
  count(year, wt = cases)

# Visualise changes over time
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

Spreading and gathering

Most data that you will encounter will be untidy.

  1. One variable might be spread across multiple columns.

  2. One observation might be scattered across multiple rows.

Gathering

One variable might be spread across multiple columns.

Gathering `table4` into a tidy form.

Gathering table4 into a tidy form.

  • gather() columns into a new pair of variables.

    table4a %>%
      gather(`1999`, `2000`, key = "year", value = "cases")
    • Key = name of the variable whose values form the column names

    • Value = name of the variable whose values are spread over the cells


  • We can gather table4b too and then join them

    tidy4a <- table4a %>% 
      gather(`1999`, `2000`, key = "year", value = "cases")
    tidy4b <- table4b %>% 
      gather(`1999`, `2000`, key = "year", value = "population")
    left_join(tidy4a, tidy4b)  # we'll learn about this in Ch. 13
    ## Joining, by = c("country", "year")

Spreading

One observation might be scattered across multiple rows.

Spreading `table2` makes it tidy

Spreading table2 makes it tidy

  • Spreading is the inverse of gathering.

    spread(table2, key = type, value = count)

Separating and uniting

table3

Separating

Separating `table3` makes it tidy

Separating table3 makes it tidy

  • table3 %>% 
      separate(rate, into = c("cases", "population"))

  • Seperate into numeric values:

    table3 %>% 
      separate(rate, into = c("cases", "population"), convert = TRUE)

  • Separate at a fixed position:

    table3 %>% 
      separate(year, into = c("century", "year"), sep = 2)

Unite

Uniting `table5` makes it tidy

Uniting table5 makes it tidy

  • table5

  • unite() is the inverse of separate().

    table5 %>% 
      unite(new, century, year, sep = "")