Model

We only cover predictive models, and use models as a tool for exploration.

Hypothesis generation vs. hypothesis confirmation

  1. Each observation can either be used for exploration or confirmation, not both.

  2. You can use an observation as many times as you like for exploration, but you can only use it once for confirmation. As soon as you use an observation twice, you’ve switched from confirmation to exploration.

  • To confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis.

  • You should never sell an exploratory analysis as a confirmatory analysis because it is fundamentally misleading.

A rule of thumb:

  1. 60% of your data goes into a training (or exploration) set. You’re allowed to do anything you like with this data: visualise it and fit tons of models to it.

  2. 20% goes into a query (or validation) set. You can use this data to compare models or visualisations by hand, but you’re not allowed to use it as part of an automated process.

  3. 20% is held back for a test set. You can only use this data ONCE, to test your final model.

Even when doing confirmatory modelling, you will still need to do EDA.

Model basics

All models are wrong, but some are useful.

  • George E. P. Box

Two parts to a model:

  1. Family of models: express a precise, but generic, pattern that you want to capture.
    • straight line: \(y = a_1 x + a_2\)
    • quadratic curve: \(y = a_1 x^{a_2}\)
    • \(x\), \(y\): known variables from your data
    • \(a_1\), \(a_2\): parameters that can vary to capture different patterns.
  2. Fitted model: the model from the family that is the closest to your data, e.g., \(y = 3 x + 7\) or \(y = 9 x^2\).

Now it would be very remarkable if any system existing in the real world could be exactly represented by any simple model. However, cunningly chosen parsimonious models often do provide remarkably useful approximations. For example, the law PV = RT relating pressure P, volume V and temperature T of an “ideal” gas via a constant R is not exactly true for any real gas, but it frequently provides a useful approximation and furthermore its structure is informative since it springs from a physical view of the behavior of gas molecules.

For such a model there is no need to ask the question “Is the model true?”. If “truth” is to be the “whole truth” the answer must be “No”. The only question of interest is “Is the model illuminating and useful?”.

  • George E. P. Box

The goal of a model is not to uncover truth, but to discover a simple approximation that is still useful.

Prerequisites

library(tidyverse)
library(modelr)
options(na.action = na.warn)

A simple model

ggplot(sim1, aes(x, y)) +   # simulated dataset, included with the modelr package
  geom_point()

Do you see a strong pattern?

250 random linear models:

models <- tibble(
  a1 = runif(250, -20, 40),
  a2 = runif(250, -5, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point() 

Finding a “good” line:

Turn our model family into an R function:

model1 <- function(a, data) {
  a[1] + data$x * a[2]
}
model1(c(7, 1.5), sim1)
##  [1]  8.5  8.5  8.5 10.0 10.0 10.0 11.5 11.5 11.5 13.0 13.0 13.0 14.5 14.5 14.5
## [16] 16.0 16.0 16.0 17.5 17.5 17.5 19.0 19.0 19.0 20.5 20.5 20.5 22.0 22.0 22.0

Distance from a model to the data:

measure_distance <- function(mod, data) {
  diff <- data$y - model1(mod, data)
  sqrt(mean(diff ^ 2))
}
measure_distance(c(7, 1.5), sim1)
## [1] 2.665212

Compute the distances:

sim1_dist <- function(a1, a2) {  # helper function. why do we need this?
  measure_distance(c(a1, a2), sim1)
}

models <- models %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models

10 best models:

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(models, rank(dist) <= 10)
  )

Visualise the models in the \(a_1-a_2\) plane (model space):

ggplot(models, aes(a1, a2)) +
  geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))

More systematic way: grid search

grid <- expand.grid(
  a1 = seq(-5, 20, length = 25),
  a2 = seq(1, 3, length = 25)
  ) %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))

grid %>% 
  ggplot(aes(a1, a2)) +
  geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist)) 

10 best models from the grid:

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(grid, rank(dist) <= 10)
  )

Numerical optimization

Imagine iteratively making the grid finer and finer.

Newton-Raphson search: pick a starting point and look around for the steepest slope. You then ski down that slope a little way, and then repeat again and again, until you can’t go any lower.

best <- optim(c(0, 0), measure_distance, data = sim1)
best$par
## [1] 4.222248 2.051204
ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(intercept = best$par[1], slope = best$par[2])

lm()

Special tool designed for fitting linear models.

Formulas: lm()’s special way to specify the model family - y ~ x translates to a function like y = a_1 + a_2 * x

sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
## (Intercept)           x 
##    4.220822    2.051533

These are exactly the same values we got with optim()!

Visualising models

Predictions

grid <- sim1 %>% 
  data_grid(x) # evenly space grid the covers the data region
grid

Add predictions:

grid <- grid %>% 
  modelr::add_predictions(sim1_mod) 
grid

Plot them:

ggplot(sim1, aes(x)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred), data = grid, colour = "red", size = 1)

Residuals

Distances between the observed and predicted values.

Tell you what the model has missed.

Add residuals to the original dataset (why?):

sim1 <- sim1 %>% 
  add_residuals(sim1_mod)
sim1

Distribution of the residuals:

ggplot(sim1, aes(resid)) + 
  geom_freqpoly(binwidth = 0.5)

How far away are the predictions from the observed values?

Typical residual plot:

sim1 %>%
    ggplot(aes(x, resid)) + 
        geom_ref_line(h = 0) +
        geom_point() 

This looks like random noise, suggesting that our model has done a good job of capturing the patterns in the dataset.

Formulas and model families

What R actually does with formulae:

df <- tribble(
  ~y, ~x1, ~x2,
  4, 2, 5,
  5, 1, 6
)
model_matrix(df, y ~ x1)

If you don’t want the intercept:

model_matrix(df, y ~ x1 - 1)

More than one variable:

model_matrix(df, y ~ x1 + x2)

This formula notation was initially described in Symbolic Description of Factorial Models for Analysis of Variance, by G. N. Wilkinson and C. E. Rogers https://www.jstor.org/stable/2346786.

Categorical variables

Suppose you have a formula like y ~ sex, where sex could either be male or female.

It doesn’t make sense to convert that to a formula like y = a_0 + a_1 * sex because sex isn’t a number - you can’t multiply it!

df <- tribble(
  ~ sex, ~ response,
  "male", 1,
  "female", 2,
  "male", 1
)
model_matrix(df, response ~ sex)

Why R also doesn’t create a sexfemale column?

The sim2 dataset from modelr:

sim2
ggplot(sim2) + 
  geom_point(aes(x, y))

Fit a model:

mod2 <- lm(y ~ x, data = sim2)

grid <- sim2 %>% 
  data_grid(x) %>% 
  add_predictions(mod2)
grid

A linear model with a categorical x will predict the mean value for each category (why?):

ggplot(sim2, aes(x)) + 
  geom_point(aes(y = y)) +
  geom_point(data = grid, aes(y = pred), colour = "red", size = 4)

You can’t make predictions about levels that you didn’t observe:

tibble(x = "e") %>% 
  add_predictions(mod2)
## Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor x has new level e

Interactions (continuous and categorical)

What happens when you combine a continuous and a categorical variable?

sim3
ggplot(sim3, aes(x1, y)) + 
  geom_point(aes(colour = x2))

Two possible models: you could fit to this data:

mod1 <- lm(y ~ x1 + x2, data = sim3)
mod2 <- lm(y ~ x1 * x2, data = sim3)
  • y ~ x1 * x2 translates to y = a_0 + a_1 * x1 + a_2 * x2 + a_12 * x1 * x2.
  • Whenever you use *, both the interaction and the individual components are included in the model.

Visualisation

grid <- sim3 %>% 
  data_grid(x1, x2) %>%      # note two predictors
  gather_predictions(mod1, mod2)   # add predicitions from two models
grid

Use facetting:

ggplot(sim3, aes(x1, y, colour = x2)) + 
  geom_point() + 
  geom_line(data = grid, aes(y = pred)) + 
  facet_wrap(~ model)

  • Model that uses + has the same slope for each line, but different intercepts.
  • Model that uses * has a different slope and intercept for each line.

Model comparison

sim3 <- sim3 %>% 
  gather_residuals(mod1, mod2)

ggplot(sim3, aes(x1, resid, colour = x2)) + 
  geom_point() + 
  facet_grid(model ~ x2)

Which model do you prefer?

Interactions (two continuous)

sim4

Initially things proceed almost identically to the previous example:

mod1 <- lm(y ~ x1 + x2, data = sim4)
mod2 <- lm(y ~ x1 * x2, data = sim4)

grid <- sim4 %>% 
  data_grid(
    x1 = seq_range(x1, 5),  # regularly spaced grid b/w min and max
    x2 = seq_range(x2, 5) 
  ) %>% 
  gather_predictions(mod1, mod2)
grid

Visualisation

We can imagine the model like a 3d surface:

ggplot(grid, aes(x1, x2)) + 
  geom_tile(aes(fill = pred)) + 
  facet_wrap(~ model)

Slice views:

ggplot(grid, aes(x1, pred, colour = x2, group = x2)) + 
  geom_line() +
  facet_wrap(~ model)

ggplot(grid, aes(x2, pred, colour = x1, group = x1)) + 
  geom_line() +
  facet_wrap(~ model)

Visual model comparison is not as easy as for a categorical and continuous variable.

Transformations

  • Model formula log(y) ~ sqrt(x1) + x2 translates to log(y) = a_1 + a_2 * sqrt(x1) + a_3 * x2.
  • If your transformation involves +, *, ^, or -, you’ll need to wrap it in I():
    • y ~ x + I(x ^ 2) translates to y = a_1 + a_2 * x + a_3 * x^2.
    • y ~ x ^ 2 + x translates to y ~ x * x + x, which in turn translates to y = a_1 + a_2 * x.

Model matrix:

df <- tribble(
  ~y, ~x,
   1,  1,
   2,  2, 
   3,  3
)
model_matrix(df, y ~ I(x^2) + x)

Polynomial regression

Fitting equations like y = a_1 + a_2 * x + a_3 * x^2 + a_4 * x ^ 3:

model_matrix(df, y ~ poly(x, 2))

Spline regression

Outside the range of the data, polynomials rapidly shoot off to positive or negative infinity. Splines stitch low-degree polynomials seamlessly:

sim5 <- tibble(
  x = seq(0, 3.5 * pi, length = 50),
  y = 4 * sin(x) + rnorm(length(x))
)

ggplot(sim5, aes(x, y)) +
  geom_point()

library(splines)
mod1 <- lm(y ~ ns(x, 1), data = sim5)
mod2 <- lm(y ~ ns(x, 2), data = sim5)
mod3 <- lm(y ~ ns(x, 3), data = sim5)
mod4 <- lm(y ~ ns(x, 4), data = sim5)
mod5 <- lm(y ~ ns(x, 5), data = sim5)

grid <- sim5 %>% 
  data_grid(x = seq_range(x, n = 50, expand = 0.1)) %>% 
  gather_predictions(mod1, mod2, mod3, mod4, mod5, .pred = "y")

ggplot(sim5, aes(x, y)) + 
  geom_point() +
  geom_line(data = grid, colour = "red") +
  facet_wrap(~ model)

The extrapolation outside the range of the data is clearly bad. The model can never tell you if the behaviour is true when you start extrapolating outside the range of the data that you have seen. You must rely on theory and science.

Missing values

R silently drop missing values, but options(na.action = na.warn) (run in the prerequisites), makes sure you get a warning.

df <- tribble(
  ~x, ~y,
  1, 2.2,
  2, NA,
  3, 3.5,
  4, 8.3,
  NA, 10
)

mod <- lm(y ~ x, data = df)
## Warning: Dropping 2 rows with missing values

To suppress the warning, set na.action = na.exclude:

mod <- lm(y ~ x, data = df, na.action = na.exclude)

See exactly how many observations were used:

nobs(mod)
## [1] 3

Other model families

  • Generalised linear models, e.g. stats::glm(). extend linear models to include non-continuous responses (e.g. binary data or counts).

  • Generalised additive models, e.g. mgcv::gam(), extend generalised linear models to incorporate arbitrary smooth functions.

  • Penalised linear models, e.g. glmnet::glmnet(), add a penalty term to the distance that penalises complex models This tends to make models that generalise better to new datasets from the same population.

  • Robust linear models, e.g. MASS:rlm(), tweak the distance to downweight points that are very far away. This makes them less sensitive to the presence of outliers, at the cost of being not quite as good when there are no outliers.

  • Trees, e.g. rpart::rpart(), fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()) or gradient boosting machines (e.g. xgboost::xgboost.)

These models all work similarly from a programming perspective. Once you’ve mastered linear models, you should find it easy to master the mechanics of these other model classes.