Factors

In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.

Package forcats

  • Provides tools for dealing with categorical variables

    library(forcats)

Creating factors

Suppose you have a variable that records month:

x1 <- c("Dec", "Apr", "Jan", "Mar")

Two problems:

  1. There are only twelve possible months, and there’s nothing saving you from typos:

    x2 <- c("Dec", "Apr", "Jam", "Mar")
  2. It doesn’t sort in a useful way:

    sort(x1)
    ## [1] "Apr" "Dec" "Jan" "Mar"

To fix these, use factors:

month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Any values not in the set will be silently converted to NA:

y2 <- factor(x2, levels = month_levels)
y2
## [1] Dec  Apr  <NA> Mar 
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Omitting levels:

factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar

Retrieve the set of valid levels directly:

levels(y1)
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

General Social Survey

Long-running US survey conducted by the independent research organization NORC at the University of Chicago.

forcats::gss_cat

How many levels?

gss_cat %>%
  count(race)
ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)

Two most common operations with factors

  1. Changing the order of the levels
  2. Changing the values of the levels

Modifying factor order

Average number of hours spent watching TV per day across religions:

relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()

vs

ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
  geom_point()

fct_reorder(f,x,fun): 1. f, the factor whose levels you want to modify. 2. x, a numeric vector that you want to use to reorder the levels. 3. fun (optional), a function used if there are multiple values of x for each value of f.

Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.

Rewrite using mutate():

relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()

Reordering isn’t always useful:

rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()


Pull “Not applicable” up front:

ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

Why do you think the average age for “Not applicable” is so high?

Other useful reordering functions: fct_reorder(), fct_reorder2(), fct_infreq(), fct_rev().

Modifying factor levels

gss_cat %>% count(partyid)

The levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction.

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)

Combine groups:

gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)

Collapse a lot of levels:

gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)

Lump small groups together to make a plot or table simpler:

gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
## # A tibble: 10 x 2
##    relig                       n
##    <fct>                   <int>
##  1 Protestant              10846
##  2 Catholic                 5124
##  3 None                     3523
##  4 Christian                 689
##  5 Other                     458
##  6 Jewish                    388
##  7 Buddhism                  147
##  8 Inter-nondenominational   109
##  9 Moslem/islam              104
## 10 Orthodox-christian         95

Dates and times

Date and times are much more complicated than they seem:

Also geopolitics:

Package lubridate

  • Makes it easier to work with dates and times in R.
  • Not core part of the tidyverse.
library(lubridate)
library(hms)

Creating date/times

3 data types:

  • Date (<date>)

  • Time within a day (<time>)

  • Date-time is a date plus a time (<dttm>): uniquely identifies an instant in time (typically to the nearest second).

today()  # current date
## [1] "2020-09-04"
now()    # current date-time
## [1] "2020-09-04 17:13:24 KST"

Date/time from strings

  1. readr::parse_date() from Lecture 5

  2. lubridate:

    ymd("2017-01-31")
    ## [1] "2017-01-31"
    mdy("January 31st, 2017")
    ## [1] "2017-01-31"
    dmy("31-Jan-2017")
    ## [1] "2017-01-31"
    ymd(20170131)
    ## [1] "2017-01-31"
    ymd_hms("2017-01-31 20:11:59")
    ## [1] "2017-01-31 20:11:59 UTC"
    mdy_hm("01/31/2017 08:01")
    ## [1] "2017-01-31 08:01:00 UTC"

From individual components

Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:

library(nycflights13)
flights %>% 
  select(year, month, day, hour, minute) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))

Convert other times:

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt

Distribution of departure times across the year:

flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

Or within a single day:

flights_dt %>% 
  filter(dep_time < ymd(20130102)) %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes

From other types

Switch between a date-time and a date:

as_datetime(today())
## [1] "2020-09-04 UTC"
as_date(now())
## [1] "2020-09-04"

Date-time components

Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.

Getting components

Accessor functions: year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().

Do more flights depart during the week than on the weekend?

flights_dt %>% 
  mutate(wday = wday(dep_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
    geom_bar()

Average departure delay by minute within the hour in actual departure time:

flights_dt %>% 
  mutate(minute = minute(dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>% 
  ggplot(aes(minute, avg_delay)) +
    geom_line()
## `summarise()` ungrouping output (override with `.groups` argument)

It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!

Average departure delay by minute within the hour in scheduled departure time:

sched_dep <- flights_dt %>% 
  mutate(minute = minute(sched_dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(sched_dep, aes(minute, avg_delay)) +
  geom_line()

We don’t see such a strong pattern.

So why do we see that pattern with the actual departure times?

ggplot(sched_dep, aes(minute, n)) +
  geom_line()

Rounding

Number of flights per week:

flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

Computing the difference between a rounded and unrounded date can be particularly useful.

Setting components

Accessor functions: year(), month(), mday() (day of the month), yday() (day of the year), wday() (day of the week), hour(), minute(), and second().

(datetime <- ymd_hms("2016-07-08 12:34:56"))
## [1] "2016-07-08 12:34:56 UTC"
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"

Alternatively, use update():

update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"

Distribution of flights across the course of the day for every day of the year:

flights_dt %>% 
  mutate(dep_hour = update(dep_time, yday = 1)) %>% 
  ggplot(aes(dep_hour)) +
    geom_freqpoly(binwidth = 300)

Time arithmetic

# How old is Hadley?
h_age <- today() - ymd(19791014)
h_age
## Time difference of 14936 days

Durations

as.duration(h_age)
## [1] "1290470400s (~40.89 years)"

Constructing durations: dseconds(), dminutes(), dhours(), etc.

Arithmetic:

2 * dyears(1)
## [1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38869200s (~1.23 years)"
tomorrow <- today() + ddays(1)
last_year <- today() - dyears(1)

Sometimes a day is not 24 hours:

one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")

one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"

Periods

To resolve this problem, use periods:

one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"

Time travel?

flights_dt %>% 
  filter(arr_time < dep_time) %>%
    select(origin, dest, dep_time, arr_time)

These are overnight flights.

flights_dt <- flights_dt %>% 
  mutate(
    overnight = arr_time < dep_time,
    arr_time = arr_time + days(overnight * 1),  # why * 1?
    sched_arr_time = sched_arr_time + days(overnight * 1)
  )

Now all of our flights obey the laws of physics.

flights_dt %>% 
  filter(overnight, arr_time < dep_time) 

Intervals

dyears(1) / ddays(365) # this is precisely defined
## [1] 1.000685
years(1) / days(1)
## [1] 365.25

An interval is a duration with a starting point:

next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 365

Time zones

Your current time zone is with Sys.timezone():

Sys.timezone()
## [1] "Asia/Seoul"

In R, the time zone is an attribute of the date-time that only controls printing.

(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
## [1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
## [1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
## [1] "2015-06-02 04:00:00 NZST"

They are the same:

x1 - x2
## Time difference of 0 secs
x1 - x3
## Time difference of 0 secs

UTC

Unless otherwise specified, lubridate always uses UTC (Coordinated Universal Time), the standard time zone used by the scientific community:

(x4 <- ymd_hms(now()))
## [1] "2020-09-04 17:13:30 UTC"

Changing the time zone

  1. Keep the instant in time the same, and change how it’s displayed.

    x4a <- with_tz(x4, tzone = "Asia/Shanghai")
    x4a
    ## [1] "2020-09-05 01:13:30 CST"
    x4a - x4
    ## Time difference of 0 secs
  2. Change the underlying instant in time (when you have an instant that has been labelled with the incorrect time zone, and you need to fix it).

    x4b <- force_tz(x4, tzone = "Asia/Shanghai")
    x4b
    ## [1] "2020-09-04 17:13:30 CST"
    x4b - x4
    ## Time difference of -8 hours