We only cover predictive models, and use models as a tool for exploration.
Each observation can either be used for exploration or confirmation, not both.
You can use an observation as many times as you like for exploration, but you can only use it once for confirmation. As soon as you use an observation twice, you’ve switched from confirmation to exploration.
To confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis.
You should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.
60% of your data goes into a training (or exploration) set. You’re allowed to do anything you like with this data: visualise it and fit tons of models to it.
20% goes into a query (or validation) set. You can use this data to compare models or visualisations by hand, but you’re not allowed to use it as part of an automated process.
20% is held back for a test set. You can only use this data ONCE, to test your final model.
All models are wrong, but some are useful.
- George E. P. Box
Two parts to a model:
Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.
For such a model there is no need to ask the question “Is the model true?”. If “truth” is to be the “whole truth” the answer must be “No”. The only question of interest is “Is the model illuminating and useful?”.
- George E. P. Box
The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.
library(tidyverse)
library(modelr)
options(na.action = na.warn)
ggplot(sim1, aes(x, y)) + # simulated dataset, included with the modelr package
geom_point()
Do you see a strong pattern?
250 random linear models:
models <- tibble(
a1 = runif(250, -20, 40),
a2 = runif(250, -5, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
geom_point()
Finding a “good” line:
Turn our model family into an R function:
model1 <- function(a, data) {
a[1] + data$x * a[2]
}
model1(c(7, 1.5), sim1)
## [1] 8.5 8.5 8.5 10.0 10.0 10.0 11.5 11.5 11.5 13.0 13.0 13.0 14.5 14.5 14.5
## [16] 16.0 16.0 16.0 17.5 17.5 17.5 19.0 19.0 19.0 20.5 20.5 20.5 22.0 22.0 22.0
Distance from a model to the data:
measure_distance <- function(mod, data) {
diff <- data$y - model1(mod, data)
sqrt(mean(diff ^ 2))
}
measure_distance(c(7, 1.5), sim1)
## [1] 2.665212
Compute the distances:
sim1_dist <- function(a1, a2) { # helper function. why do we need this?
measure_distance(c(a1, a2), sim1)
}
models <- models %>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models
10 best models:
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, colour = -dist),
data = filter(models, rank(dist) <= 10)
)
Visualise the models in the \(a_1-a_2\) plane (model space):
ggplot(models, aes(a1, a2)) +
geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))
More systematic way: grid search
grid <- expand.grid(
a1 = seq(-5, 20, length = 25),
a2 = seq(1, 3, length = 25)
) %>%
mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
grid %>%
ggplot(aes(a1, a2)) +
geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
geom_point(aes(colour = -dist))
10 best models from the grid:
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(
aes(intercept = a1, slope = a2, colour = -dist),
data = filter(grid, rank(dist) <= 10)
)
Imagine iteratively making the grid finer and finer.
Newton-Raphson search: pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can’t go any lower.
best <- optim(c(0, 0), measure_distance, data = sim1)
best$par
## [1] 4.222248 2.051204
ggplot(sim1, aes(x, y)) +
geom_point(size = 2, colour = "grey30") +
geom_abline(intercept = best$par[1], slope = best$par[2])
lm()
Special tool designed for fitting linear models.
Formulas: lm()
’s special way to specify the model family - y ~ x
translates to a function like y = a_1 + a_2 * x
sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
## (Intercept) x
## 4.220822 2.051533
These are exactly the same values we got with optim()
!
grid <- sim1 %>%
data_grid(x) # evenly space grid the covers the data region
grid
Add predictions:
grid <- grid %>%
modelr::add_predictions(sim1_mod)
grid
Plot them:
ggplot(sim1, aes(x)) +
geom_point(aes(y = y)) +
geom_line(aes(y = pred), data = grid, colour = "red", size = 1)
Distances between the observed and predicted values.
Tell you what the model has missed.
Add residuals to the original dataset (why?):
sim1 <- sim1 %>%
add_residuals(sim1_mod)
sim1
Distribution of the residuals:
ggplot(sim1, aes(resid)) +
geom_freqpoly(binwidth = 0.5)
How far away are the predictions from the observed values?
Typical residual plot:
sim1 %>%
ggplot(aes(x, resid)) +
geom_ref_line(h = 0) +
geom_point()
This looks like random noise, suggesting that our model has done a good job of capturing the patterns in the dataset.
What R actually does with formulae:
df <- tribble(
~y, ~x1, ~x2,
4, 2, 5,
5, 1, 6
)
model_matrix(df, y ~ x1)
If you don’t want the intercept:
model_matrix(df, y ~ x1 - 1)
More than one variable:
model_matrix(df, y ~ x1 + x2)
This formula notation was initially described in Symbolic Description of Factorial Models for Analysis of Variance, by G. N. Wilkinson and C. E. Rogers https://www.jstor.org/stable/2346786.
Suppose you have a formula like y ~ sex
, where sex could either be male or female.
It doesn’t make sense to convert that to a formula like y = a_0 + a_1 * sex
because sex
isn’t a number - you can’t multiply it!
df <- tribble(
~ sex, ~ response,
"male", 1,
"female", 2,
"male", 1
)
model_matrix(df, response ~ sex)
Why R also doesn’t create a sexfemale
column?
The sim2
dataset from modelr:
sim2
ggplot(sim2) +
geom_point(aes(x, y))
Fit a model:
mod2 <- lm(y ~ x, data = sim2)
grid <- sim2 %>%
data_grid(x) %>%
add_predictions(mod2)
grid
A linear model with a categorical x
will predict the mean value for each category (why?):
ggplot(sim2, aes(x)) +
geom_point(aes(y = y)) +
geom_point(data = grid, aes(y = pred), colour = "red", size = 4)
You can’t make predictions about levels that you didn’t observe:
tibble(x = "e") %>%
add_predictions(mod2)
## Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor x has new level e
What happens when you combine a continuous and a categorical variable?
sim3
ggplot(sim3, aes(x1, y)) +
geom_point(aes(colour = x2))
Two possible models: you could fit to this data:
mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
y ~ x1 * x2
translates to y = a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1 * x2
.*
, both the interaction and the individual components are included in the model.grid <- sim3 %>%
data_grid(x1, x2) %>% # note two predictors
gather_predictions(mod1, mod2) # add predicitions from two models
grid
Use facetting:
ggplot(sim3, aes(x1, y, colour = x2)) +
geom_point() +
geom_line(data = grid, aes(y = pred)) +
facet_wrap(~ model)
+
has the same slope for each line, but different intercepts.*
has a different slope and intercept for each line.sim3 <- sim3 %>%
gather_residuals(mod1, mod2)
ggplot(sim3, aes(x1, resid, colour = x2)) +
geom_point() +
facet_grid(model ~ x2)
Which model do you prefer?
sim4
Initially things proceed almost identically to the previous example:
mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)
grid <- sim4 %>%
data_grid(
x1 = seq_range(x1, 5), # regularly spaced grid b/w min and max
x2 = seq_range(x2, 5)
) %>%
gather_predictions(mod1, mod2)
grid
We can imagine the model like a 3d surface:
ggplot(grid, aes(x1, x2)) +
geom_tile(aes(fill = pred)) +
facet_wrap(~ model)
Slice views:
ggplot(grid, aes(x1, pred, colour = x2, group = x2)) +
geom_line() +
facet_wrap(~ model)
ggplot(grid, aes(x2, pred, colour = x1, group = x1)) +
geom_line() +
facet_wrap(~ model)
Visual model comparison is not as easy as for a categorical and continuous variable.
log(y) ~ sqrt(x1) + x2
translates to log(y) = a_1 + a_2 * sqrt(x1) + a_3 * x2
.+
, *
, ^
, or -
, you’ll need to wrap it in I()
:
y ~ x + I(x ^ 2)
translates to y = a_1 + a_2 * x + a_3 * x^2
.y ~ x ^ 2 + x
translates to y ~ x * x + x
, which in turn translates to y = a_1 + a_2 * x
.Model matrix:
df <- tribble(
~y, ~x,
1, 1,
2, 2,
3, 3
)
model_matrix(df, y ~ I(x^2) + x)
Fitting equations like y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3
:
model_matrix(df, y ~ poly(x, 2))
Outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. Splines stitch low-degree polynomials seamlessly:
sim5 <- tibble(
x = seq(0, 3.5 * pi, length = 50),
y = 4 * sin(x) + rnorm(length(x))
)
ggplot(sim5, aes(x, y)) +
geom_point()
library(splines)
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)
grid <- sim5 %>%
data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>%
gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")
ggplot(sim5, aes(x, y)) +
geom_point() +
geom_line(data = grid, colour = "red") +
facet_wrap(~ model)
The extrapolation outside the range of the data is clearly bad. The model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.
R silently drop missing values, but options(na.action = na.warn)
(run in the prerequisites), makes sure you get a warning.
df <- tribble(
~x, ~y,
1, 2.2,
2, NA,
3, 3.5,
4, 8.3,
NA, 10
)
mod <- lm(y ~ x, data = df)
## Warning: Dropping 2 rows with missing values
To suppress the warning, set na.action = na.exclude
:
mod <- lm(y ~ x, data = df, na.action = na.exclude)
See exactly how many observations were used:
nobs(mod)
## [1] 3
Generalised linear models, e.g. stats::glm()
. extend linear models to include non-continuous responses (e.g. binary data or counts).
Generalised additive models, e.g. mgcv::gam()
, extend generalised linear models to incorporate arbitrary smooth functions.
Penalised linear models, e.g. glmnet::glmnet()
, add a penalty term to the distance that penalises complex models This tends to make models that generalise better to new datasets from the same population.
Robust linear models, e.g. MASS:rlm()
, tweak the distance to downweight points that are very far away. This makes them less sensitive to the presence of outliers, at the cost of being not quite as good when there are no outliers.
Trees, e.g. rpart::rpart()
, fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()
) or gradient boosting machines (e.g. xgboost::xgboost
.)
These models all work similarly from a programming perspective. Once you’ve mastered linear models, you should find it easy to master the mechanics of these other model classes.