Introduction

Make sure loading tidyverse:

    library("tidyverse")

Tibbles

Tibbles extend data frames in base R and form the core of tidyverse.

Creating tibbles

Convert a regular data frame to tibble:

    # a regular data frame
    head(iris)

    as_tibble(iris)

Convert a tibble to data frame:

    as.data.frame(tb)

Create a new tibble from individual vectors:

    tibble(
      x = 1:5, 
      y = 1, 
      z = x ^ 2 + y
    )

Transposed tibbles:

    tribble(
      ~x, ~y, ~z,
      #--|--|----
      "a", 2, 3.6,
      "b", 1, 8.5
    )

Printing a tibble

By default, tibble prints the first 10 rows and all columns that fit on screen.

    tibble(
      a = lubridate::now() + runif(1e3) * 86400,
      b = lubridate::today() + runif(1e3) * 30,
      c = 1:1e3,
      d = runif(1e3),
      e = sample(letters, 1e3, replace = TRUE)   # letters() == 'a' -- 'z'
    )

To change number of rows and columns to display:

    nycflights13::flights %>% 
      print(n = 10, width = Inf)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
##  2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
##  3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
##  4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
##  5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
##  6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
##  7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
##  8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
##  9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
## 10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 05:00:00
##  3 2013-01-01 05:00:00
##  4 2013-01-01 05:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # … with 336,766 more rows

To change the default print behaviour:
- options(tibble.print_max = n, tibble.print_min = m): if more than m rows, print only n rows.
- options(dplyr.print_min = Inf): print all row.
- options(tibble.width = Inf): print all columns.

Subsetting

    df <- tibble(
      x = runif(5),
      y = rnorm(5)
    )

Extract by name:

    df$x

## [1] 0.5289031 0.8118254 0.4098623 0.5726840 0.2423287

    df[["x"]]

## [1] 0.5289031 0.8118254 0.4098623 0.5726840 0.2423287

Extract by position：

    df[[1]]

## [1] 0.5289031 0.8118254 0.4098623 0.5726840 0.2423287

Pipe. Use the special placeholder .:

    df %>% .$x

## [1] 0.5289031 0.8118254 0.4098623 0.5726840 0.2423287

    df %>% .[["x"]]

## [1] 0.5289031 0.8118254 0.4098623 0.5726840 0.2423287

Data import

Let’s start working with our own data. We’ll learn how to read plain-text rectangular files into R.

Package readr

readr package implements functions that turn flat files into tibbles.
- read_csv() (comma delimited files), read_csv2() (semicolon seperated files), read_tsv() (tab delimited files), read_delim() (files with any delimiter).
- read_fwf() (fixed width files), read_table() (fixed width files where columns are separated by white space).
- read_log() (Apache style log files).
An example file heights.csv is in this link:

    head ./data/heights.csv

## "earn","height","sex","ed","age","race"
## 50000,74.4244387818035,"male",16,45,"white"
## 60000,65.5375428255647,"female",16,58,"white"
## 30000,63.6291977374349,"female",16,29,"white"
## 50000,63.1085616752971,"female",16,91,"other"
## 51000,63.4024835710879,"female",17,39,"white"
## 9000,64.3995075440034,"female",15,26,"white"
## 29000,61.6563258264214,"female",12,49,"white"
## 32000,72.6985437364783,"male",17,46,"white"
## 2000,72.0394668497611,"male",15,21,"hispanic"

Read from a local file:

    (heights <- read_csv("data/heights.csv"))

## Parsed with column specification:
## cols(
##   earn = col_double(),
##   height = col_double(),
##   sex = col_character(),
##   ed = col_double(),
##   age = col_double(),
##   race = col_character()
## )

Read from inline csv file. Useful for experimenting with readr:

    read_csv("a,b,c
    1,2,3
    4,5,6")

Skip first n lines:

    read_csv("The first line of metadata
      The second line of metadata
      x,y,z
      1,2,3", skip = 2)

Skip comment lines:

    read_csv("# A comment I want to skip
      x,y,z
      1,2,3", comment = "#")

No header line:

    read_csv("1,2,3\n4,5,6", col_names = FALSE)

(`\n` is a convenient shortcut for adding a new line.)

No header line and specify col_names:

    read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

Specify the symbol representing missing values:

    read_csv("a,b,c\n1,2,.", na = ".")

Parsing a vector

parse_*() functions take a character vector and return a more specialised vector like a logical, integer, or date:

  str(parse_logical(c("TRUE", "FALSE", "NA")))

##  logi [1:3] TRUE FALSE NA

  str(parse_integer(c("1", "2", "3")))

##  int [1:3] 1 2 3

  str(parse_date(c("2010-01-01", "1979-10-14")))

##  Date[1:2], format: "2010-01-01" "1979-10-14"

Missing values:

  parse_integer(c("1", "231", ".", "456"), na = ".")

## [1]   1 231  NA 456

If parsing fails, you’ll get a warning and the failures will be missing in the output:

  x <- parse_integer(c("123", "345", "abc", "123.45"))

## Warning: 2 parsing failures.
## row col               expected actual
##   3  -- an integer                abc
##   4  -- no trailing characters    .45

## [1] 123 345  NA  NA
## attr(,"problems")
## # A tibble: 2 x 4
##     row   col expected               actual
##   <int> <int> <chr>                  <chr> 
## 1     3    NA an integer             abc   
## 2     4    NA no trailing characters .45

Numbers

parse_double() and parse_number().

  # some countries use `.` in between the integer and fractional 
  # parts of a real number, while others use `,`.
  parse_double("1.23")

## [1] 1.23

  parse_double("1,23", locale = locale(decimal_mark = ","))

## [1] 1.23

  # Numbers are often surrounded by other characters that provide some context
  parse_number("$100")

## [1] 100

  parse_number("20%")

## [1] 20

  parse_number("It cost $123.45")

## [1] 123.45

  # “Grouping” characters to make numbers easier to read, like “1,000,000”, 
  # vary around the world.
  parse_number("$123,456,789") # Used in America

## [1] 123456789

  parse_number("123.456.789", locale = locale(grouping_mark = ".")) # Used in many parts of Europe

## [1] 123456789

  parse_number("123'456'789", locale = locale(grouping_mark = "'")) # Used in Switzerland

## [1] 123456789

Strings

Complications: there are multiple ways to represent the same string.

  charToRaw("Hadley")

## [1] 48 61 64 6c 65 79

The mapping from hexadecimal number to character is called the encoding.

Modern strings are almost always encoded in UTF-8, but there still are data produced by older systems that don’t understand UTF-8:

  x1 <- "El Ni\xf1o was particularly bad this year"   # Latin1 (aka ISO-8859-1)
  x2 <- "\xbe\xc8\xb3\xe7\xc7\xcf\xbc\xbc\xbf\xe4"    # EUC-KR (Korean)
  
  x1

## [1] "El Ni\xf1o was particularly bad this year"

x2

## [1] "\xbeȳ\xe7\xc7\u03fc\xbc\xbf\xe4"

To fix the problem you need to specify the encoding in parse_character():

  parse_character(x1, locale = locale(encoding = "Latin1"))

## [1] "El Niño was particularly bad this year"

  parse_character(x2, locale = locale(encoding = "EUC-KR"))

## [1] "안녕하세요"

Factors

R uses factors to represent categorical variables that have a known set of possible values:

  fruit <- c("apple", "banana")
  parse_factor(c("apple", "banana", "bananana"), levels = fruit)

## Warning: 1 parsing failure.
## row col           expected   actual
##   3  -- value in level set bananana

## [1] apple  banana <NA>  
## attr(,"problems")
## # A tibble: 1 x 4
##     row   col expected           actual  
##   <int> <int> <chr>              <chr>   
## 1     3    NA value in level set bananana
## Levels: apple banana

Dates, date-times, and times

Computer representation of time: 1. date (the number of days since 1970-01-01) 2. date-time (the number of seconds since midnight 1970-01-01) 3. time (the number of seconds since midnight 1970-01-01)

parse_datetime() expects an ISO8601 date-time. ISO8601 is an international standard in which the components of a date are organised from biggest to smallest: year, month, day, hour, minute, second.

    parse_datetime("2010-10-01T2010")

## [1] "2010-10-01 20:10:00 UTC"

    # If time is omitted, it will be set to midnight
    parse_datetime("20101010")

## [1] "2010-10-10 UTC"

parse_date() expects a four digit year, a - or /, the month, a - or /, then the day:

    parse_date("2010-10-01")

## [1] "2010-10-01"

parse_time() expects the hour, :, minutes, optionally : and seconds, and an optional am/pm specifier:

    library(hms)
    parse_time("01:10 am")

## 01:10:00

    parse_time("20:10:01")

## 20:10:01

Parsing a file

Challenges

How readr automatically guesses the type of each column.
How to override the default specification.

Strategy

Heuristic used by readr: it reads the first 1000 rows and tries each of the following types, stopping when it finds a match:

logical: contains only “F”, “T”, “FALSE”, or “TRUE”.
integer: contains only numeric characters (and -).
double: contains only valid doubles (including numbers like 4.5e-5).
number: contains valid doubles with the grouping mark inside.
time: matches the default time_format.
date: matches the default date_format.
date-time: any ISO8601 date.

If none of these rules apply, then the column will stay as a vector of strings.

Problems

First 1000 rows are not enough.
Missing values.

A challenging case:

  challenge <- read_csv(readr_example("challenge.csv"))

## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_logical()
## )

## Warning: 1000 parsing failures.
##  row col           expected     actual                                                                                         file
## 1001   y 1/0/T/F/TRUE/FALSE 2015-01-16 '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/readr/extdata/challenge.csv'
## 1002   y 1/0/T/F/TRUE/FALSE 2018-05-18 '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/readr/extdata/challenge.csv'
## 1003   y 1/0/T/F/TRUE/FALSE 2015-09-05 '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/readr/extdata/challenge.csv'
## 1004   y 1/0/T/F/TRUE/FALSE 2012-11-28 '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/readr/extdata/challenge.csv'
## 1005   y 1/0/T/F/TRUE/FALSE 2020-01-13 '/Library/Frameworks/R.framework/Versions/4.0/Resources/library/readr/extdata/challenge.csv'
## .... ... .................. .......... ............................................................................................
## See problems(...) for more details.

Let’s identify the problem:

  problems(challenge)

A good strategy is to work column by column until there are no problems remaining.

Fix the first problem:

  challenge <- read_csv(
    readr_example("challenge.csv"), 
    col_types = cols(
      x = col_double(),
      y = col_character()
    )
  )

But the y appears to be dates stored in a character vector:

    tail(challenge)

Fix this:

  challenge <- read_csv(
    readr_example("challenge.csv"), 
    col_types = cols(
      x = col_double(),
      y = col_date()
    )
  )
  tail(challenge)

Other strategies

Look at a few more rows:

    challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)

## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_date(format = "")
## )

    challenge2

Read in all the columns as character vectors:

    challenge2 <- read_csv(readr_example("challenge.csv"), 
      col_types = cols(.default = col_character())
    )

and apply `type_convert()`:

    df <- tribble(
      ~x,  ~y,
      "1", "1.21",
      "2", "2.32",
      "3", "4.56"
    )
    df

    # Note the column types
    type_convert(df)

## Parsed with column specification:
## cols(
##   x = col_double(),
##   y = col_double()
## )

Writing to a file

Write to csv:

    write_csv(challenge, "challenge.csv")

Write (and read) an RDS file, a custom binary format that preserves data types:

    write_rds(challenge, "challenge.rds")
    read_rds("challenge.rds")

Other types of data

haven reads SPSS, Stata, and SAS files.
readxl reads excel files (both .xls and .xlsx).
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.
jsonlite reads JSON files; JSON is a standard data exchange format on the web.
xml2 reads XML files.

Tidy data

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.

Example table1

    table1

is tidy.

Example table2

    table2

is not tidy.

Example table3

    table3

is not tidy.

Example table4a

    table4a

is not tidy.

Example table4b

    table4b

is not tidy.

Why tidy data?

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)

# Compute cases per year
table1 %>% 
  count(year, wt = cases)

# Visualise changes over time
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

Spreading and gathering

Most data that you will encounter will be untidy.

One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.

Gathering

One variable might be spread across multiple columns.

Gathering table4 into a tidy form.

gather() columns into a new pair of variables.

    table4a %>%
      gather(`1999`, `2000`, key = "year", value = "cases")

+ __Key__ = name of the variable whose values form the column names

+ __Value__ = name of the variable whose values are spread over the cells

We can gather table4b too and then join them

    tidy4a <- table4a %>% 
      gather(`1999`, `2000`, key = "year", value = "cases")
    tidy4b <- table4b %>% 
      gather(`1999`, `2000`, key = "year", value = "population")
    left_join(tidy4a, tidy4b)  # we'll learn about this in Ch. 13

## Joining, by = c("country", "year")

Spreading

One observation might be scattered across multiple rows.

Spreading table2 makes it tidy

Spreading is the inverse of gathering.

    spread(table2, key = type, value = count)

Separating and uniting

table3

Separating

Separating table3 makes it tidy

    table3 %>% 
      separate(rate, into = c("cases", "population"))

Seperate into numeric values:

    table3 %>% 
      separate(rate, into = c("cases", "population"), convert = TRUE)

Separate at a fixed position:

    table3 %>% 
      separate(year, into = c("century", "year"), sep = 2)

Unite

Uniting table5 makes it tidy

    table5

unite() is the inverse of separate().

    table5 %>% 
      unite(new, century, year, sep = "")

Lecture 5: Import and Tidy Data

Joong-Ho Won @ SNU

Introduction

Tibbles

Creating tibbles

Printing a tibble

Subsetting

Data import

Package readr

Parsing a vector

Numbers

Strings

Factors

Dates, date-times, and times

Parsing a file

Challenges

Strategy

Problems

Other strategies

Writing to a file

Other types of data

Tidy data

Tidy data

Why tidy data?

Spreading and gathering

Gathering

Spreading

Separating and uniting

Separating

Unite