In R, factors are used to work with categorical variables, variables that have a fixed and known set of possible values.
Provides tools for dealing with categorical variables
library(forcats)
Suppose you have a variable that records month:
x1 <- c("Dec", "Apr", "Jan", "Mar")
Two problems:
There are only twelve possible months, and there’s nothing saving you from typos:
x2 <- c("Dec", "Apr", "Jam", "Mar")
It doesn’t sort in a useful way:
sort(x1)
## [1] "Apr" "Dec" "Jan" "Mar"
To fix these, use factors:
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
## [1] Dec Apr Jan Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
## [1] Jan Mar Apr Dec
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Any values not in the set will be silently converted to NA:
y2 <- factor(x2, levels = month_levels)
y2
## [1] Dec Apr <NA> Mar
## Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Omitting levels:
factor(x1)
## [1] Dec Apr Jan Mar
## Levels: Apr Dec Jan Mar
Retrieve the set of valid levels directly:
levels(y1)
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
Average number of hours spent watching TV per day across religions:
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(relig_summary, aes(tvhours, relig)) + geom_point()
vs
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
fct_reorder(f,x,fun)
: 1. f
, the factor whose levels you want to modify. 2. x
, a numeric vector that you want to use to reorder the levels. 3. fun
(optional), a function used if there are multiple values of x
for each value of f
.
Reordering religion makes it much easier to see that people in the “Don’t know” category watch much more TV, and Hinduism & Other Eastern religions watch much less.
Rewrite using mutate()
:
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
Reordering isn’t always useful:
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n()
)
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
Pull “Not applicable” up front:
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()
Why do you think the average age for “Not applicable” is so high?
Other useful reordering functions: fct_reorder()
, fct_reorder2()
, fct_infreq()
, fct_rev()
.
gss_cat %>% count(partyid)
The levels are terse and inconsistent. Let’s tweak them to be longer and use a parallel construction.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
Combine groups:
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
Collapse a lot of levels:
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
Lump small groups together to make a plot or table simpler:
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
## # A tibble: 10 x 2
## relig n
## <fct> <int>
## 1 Protestant 10846
## 2 Catholic 5124
## 3 None 3523
## 4 Christian 689
## 5 Other 458
## 6 Jewish 388
## 7 Buddhism 147
## 8 Inter-nondenominational 109
## 9 Moslem/islam 104
## 10 Orthodox-christian 95
Date and times are much more complicated than they seem:
Also geopolitics:
library(lubridate)
library(hms)
3 data types:
Date (<date>
)
Time within a day (<time>
)
Date-time is a date plus a time (<dttm>
): uniquely identifies an instant in time (typically to the nearest second).
today() # current date
## [1] "2020-09-04"
now() # current date-time
## [1] "2020-09-04 17:13:24 KST"
readr::parse_date()
from Lecture 5
lubridate:
ymd("2017-01-31")
## [1] "2017-01-31"
mdy("January 31st, 2017")
## [1] "2017-01-31"
dmy("31-Jan-2017")
## [1] "2017-01-31"
ymd(20170131)
## [1] "2017-01-31"
ymd_hms("2017-01-31 20:11:59")
## [1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
## [1] "2017-01-31 08:01:00 UTC"
Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns. This is what we have in the flights data:
library(nycflights13)
flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
Convert other times:
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
Distribution of departure times across the year:
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
Or within a single day:
flights_dt %>%
filter(dep_time < ymd(20130102)) %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
Switch between a date-time and a date:
as_datetime(today())
## [1] "2020-09-04 UTC"
as_date(now())
## [1] "2020-09-04"
Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.
Accessor functions: year()
, month()
, mday()
(day of the month), yday()
(day of the year), wday()
(day of the week), hour()
, minute()
, and second()
.
Do more flights depart during the week than on the weekend?
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
ggplot(aes(x = wday)) +
geom_bar()
Average departure delay by minute within the hour in actual departure time:
flights_dt %>%
mutate(minute = minute(dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()) %>%
ggplot(aes(minute, avg_delay)) +
geom_line()
## `summarise()` ungrouping output (override with `.groups` argument)
It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
Average departure delay by minute within the hour in scheduled departure time:
sched_dep <- flights_dt %>%
mutate(minute = minute(sched_dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(sched_dep, aes(minute, avg_delay)) +
geom_line()
We don’t see such a strong pattern.
So why do we see that pattern with the actual departure times?
ggplot(sched_dep, aes(minute, n)) +
geom_line()
Number of flights per week:
flights_dt %>%
count(week = floor_date(dep_time, "week")) %>%
ggplot(aes(week, n)) +
geom_line()
Computing the difference between a rounded and unrounded date can be particularly useful.
Accessor functions: year()
, month()
, mday()
(day of the month), yday()
(day of the year), wday()
(day of the week), hour()
, minute()
, and second()
.
(datetime <- ymd_hms("2016-07-08 12:34:56"))
## [1] "2016-07-08 12:34:56 UTC"
year(datetime) <- 2020
datetime
## [1] "2020-07-08 12:34:56 UTC"
Alternatively, use update()
:
update(datetime, year = 2020, month = 2, mday = 2, hour = 2)
## [1] "2020-02-02 02:34:56 UTC"
Distribution of flights across the course of the day for every day of the year:
flights_dt %>%
mutate(dep_hour = update(dep_time, yday = 1)) %>%
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 300)
# How old is Hadley?
h_age <- today() - ymd(19791014)
h_age
## Time difference of 14936 days
as.duration(h_age)
## [1] "1290470400s (~40.89 years)"
Constructing durations: dseconds()
, dminutes()
, dhours()
, etc.
Arithmetic:
2 * dyears(1)
## [1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(15)
## [1] "38869200s (~1.23 years)"
tomorrow <- today() + ddays(1)
last_year <- today() - dyears(1)
Sometimes a day is not 24 hours:
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
## [1] "2016-03-13 14:00:00 EDT"
To resolve this problem, use periods:
one_pm
## [1] "2016-03-12 13:00:00 EST"
one_pm + days(1)
## [1] "2016-03-13 13:00:00 EDT"
Time travel?
flights_dt %>%
filter(arr_time < dep_time) %>%
select(origin, dest, dep_time, arr_time)
These are overnight flights.
flights_dt <- flights_dt %>%
mutate(
overnight = arr_time < dep_time,
arr_time = arr_time + days(overnight * 1), # why * 1?
sched_arr_time = sched_arr_time + days(overnight * 1)
)
Now all of our flights obey the laws of physics.
flights_dt %>%
filter(overnight, arr_time < dep_time)
dyears(1) / ddays(365) # this is precisely defined
## [1] 1.000685
years(1) / days(1)
## [1] 365.25
An interval is a duration with a starting point:
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
## [1] 365
Your current time zone is with Sys.timezone()
:
Sys.timezone()
## [1] "Asia/Seoul"
In R, the time zone is an attribute of the date-time that only controls printing.
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
## [1] "2015-06-01 12:00:00 EDT"
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
## [1] "2015-06-01 18:00:00 CEST"
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
## [1] "2015-06-02 04:00:00 NZST"
They are the same:
x1 - x2
## Time difference of 0 secs
x1 - x3
## Time difference of 0 secs
Unless otherwise specified, lubridate always uses UTC (Coordinated Universal Time), the standard time zone used by the scientific community:
(x4 <- ymd_hms(now()))
## [1] "2020-09-04 17:13:30 UTC"
Keep the instant in time the same, and change how it’s displayed.
x4a <- with_tz(x4, tzone = "Asia/Shanghai")
x4a
## [1] "2020-09-05 01:13:30 CST"
x4a - x4
## Time difference of 0 secs
Change the underlying instant in time (when you have an instant that has been labelled with the incorrect time zone, and you need to fix it).
x4b <- force_tz(x4, tzone = "Asia/Shanghai")
x4b
## [1] "2020-09-04 17:13:30 CST"
x4b - x4
## Time difference of -8 hours