제 11강: 선형모형 2

데이터과학 입문

원중호

서울대학교 통계학과

May 2024

시작하기 전에

다음의 패키지가 설치되어 있지 않으면 설치한다.

# install.packages("MASS")
# install.packages("tidyverse")
# install.packages("kableExtra") # for pretty tables
# install.packages("patchwork")  # for side-by-side ggplots
# install.packages("multcomp")  # for multiple comparisons
library(MASS)
library(tidyverse)
library(kableExtra) 
library(patchwork) 
library(multcomp)

회귀 진단

Whiteside 자료는 적절한 선형 모형을 찾기 쉬운 편이나 일반적으로 하나 혹은 그 이상의 점들이 적합되지 않거나 그 점들이 모형 적합에 큰 영향을 끼쳤는지를 평가하기 위한 연구가 많이 되어 있다 (Atkinson, 1985).

잔차

\[ \mathbf{e} = \hat{\mathbf{y}} - \mathbf{y} \]

회귀 진단의 기본 도구
잔차 이용의 예: 경향성 유무, 분포의 정규성 진단
잔차는 서로 독립이 아님 (절편 항이 있으면 잔차합이 0)
잔차는 등분산이 아님

잔차 공분산 행렬 \[ \operatorname{cov}(\mathbf{e}) = \sigma^2 (I_n - H), \quad H=X(X^T X)^{-1} X^T \]
- \(H\) : hat matrix, 모형 공간으로의 직교사영행렬 (\(\hat{\mathbf{y}} = H\mathbf{y}\))
지렛대값(leverage)
- \(h_{ii}\) = \(H\)의 \(i\)번째 대각원소, 관측값 \(y_{i}\)의 지렛대
  - \(h_{ii}\) 값이 크면 \(y_i\)가 변화할 때 적합평면이 바뀐 값 쪽으로 크게 이동한다.
- \(\operatorname{tr}(H)=p\), 모형 공간의 차원 또는 공변량의 갯수
- “\(h_{ii}\)가 크다”의 기준: \(p/n = \frac{1}{n}\sum_{i=1}^n h_{ii}\)의 2 또는 3배

표준화 잔차

지렛대값(\(h_{ii}\))이 클 때 해당 잔차(\(e_i\))의 분산이 잔차 전체의 분산보다 작아짐: \[ \operatorname{var}(e_i) = \sigma^2(1 - h_{ii}) \ll \frac{\sigma^2}{n} \sum_{j=1}^{n} (1-h_{jj}) = \frac{1}{n} \sum_{j=1}^n [\operatorname{cov}(e)]_{jj} \]
표준화 잔차로 이를 보정 \[ e_{i}^{\prime} = \frac{e_i}{s \sqrt{1-h_{ii}}}, \quad i=1, \dotsc, n \]
- \(s^2\)은 평균잔차제곱합(RSS/\(n\))
- 잔차의 분산을 모두 1로 맞춤

스튜던트화 잔차

만약 잔차값 하나가 너무 크면 분산의 추정치 \(s^2\)도 아주 클 것이고, 이로 인해 모든 표준화 잔차값들을 수축시킴
스튜던트화 잔차 (잭나이프화 잔차, jackknifed residual) \[ e_{i}^{*} = \frac{y_i - \hat{y}_{(i)}}{\sqrt{\operatorname{var}(y_i - \hat{y}_{(i)})}} = \frac{y_i - \hat{y}_{(i)}}{s_{(i)}\sqrt{1-h_{ii}}}, \quad i=1, \dotsc, n \]
- \(\widehat{y}_{(i)}\) = \(i\)번째 관측치를 제외한 나머지 관측치들로 적합한 모형으로 구한 \(i\)번 째 종속변수의 예측치
- \(s_{(i)}^{2}\) = \(i\)번 째 관측치를 제외한 나머지 관측치들로 적합한 모형으로 구한 오차의 평균제곱합

관측치를 제외할 때마다 모형을 새로 적합하는 계산을 하지 않아도 됨: \[ e_{i}^{*} = e_{i}^{\prime} \left[ \frac{n-p-{{e}_{i}^{\prime}}^2}{n-p-1} \right]^{1/2}, \quad i=1, \dotsc, n \] 임이 알려져 있다.
- \(|e_{i}^{\prime}| \le \sqrt{n-p}, \quad i=1,2, \dots, n\)
보통 스튜던트화 잔차를 비교하는 것이 잔차를 비교하는 것보다 좋음.

Scottish hill races 자료

as.tibble(MASS::hills)

# A tibble: 35 × 3
    dist climb  time
   <dbl> <int> <dbl>
 1   2.5   650  16.1
 2   6    2500  48.4
 3   6     900  33.6
 4   7.5   800  45.6
 5   8    3070  62.3
 6   8    2866  73.2
 7  16    7500 205. 
 8   6     800  36.4
 9   5     800  29.8
10   6     650  39.8
# ℹ 25 more rows

변수 : 달리기의 전체 거리, 총 등반 높이, 소요 시간

(hills.lm <-  lm(time ~ dist + climb, data = hills))


Call:
lm(formula = time ~ dist + climb, data = hills)

Coefficients:
(Intercept)         dist        climb  
   -8.99204      6.21796      0.01105

    # 스튜던트화 잔차도 (MASS::studres)
    par(fig = c(x1 = 0,  x2 = 0.6,  y1 = 0, y2 = 1))
    plot(fitted(hills.lm), MASS::studres(hills.lm), pch = 16)
    abline(h = 0, lty = 2)
    # Two large residual points
    ord <- order(MASS::studres(hills.lm), decreasing=T)
    x1 <- fitted(hills.lm)[ord][1]
    x2 <- fitted(hills.lm)[ord][2]
    y1 <- MASS::studres(hills.lm)[ord][1] 
    y2 <- MASS::studres(hills.lm)[ord][2] 
    text(x = x1, y = y1, label = names(y1), pos=4)
    text(x = x2, y = y2, label = names(y2), pos=2)
    # Quantile-quantile plot
    par(fig = c(0.6,  1,  0,  1), new = T)
    qqnorm(MASS::studres(hills.lm), pch = 16, 
           xlab = "Quantiles of Standard Normal", 
           ylab = "studres(hills.lm)", main = NULL)
    qqline(MASS::studres(hills.lm))

    hills.hat <- lm.influence(hills.lm)$hat  # leverage
    cbind(hills, lev = hills.hat) [hills.hat > 3/35,  ]

              dist climb    time       lev
Bens of Jura    16  7500 204.617 0.4204346
Lairig Ghru     28  2100 192.667 0.6898161
Ben Nevis       10  4400  85.583 0.1215821
Two Breweries   18  5200 170.250 0.1715848
Moffat Chase    20  5000 159.833 0.1909891

지렛대 값이 높은 점이 2개이고, 스튜던트화 잔차가 큰 점도 2개이다. Bens of Jura는 두 가지 모두에 포함된다.

Knock Hill을 보면, 실제 기록보다 예측이 1시간 (60분)이상 작다.

cbind(hills, pred = predict(hills.lm))["Knock Hill",]

           dist climb  time    pred
Knock Hill    3   350 78.65 13.5286

기록에 1시간 정도 오차가 있는 것으로 보인다 (Atkinson, 1988). 따라서 Knock Hill 관측치를 제외하자.

(hills1.lm <- update(hills.lm, 
                     subset = !(rownames(hills) %in% c("Knock Hill"))
                     )
 )


Call:
lm(formula = time ~ dist + climb, data = hills, subset = !(rownames(hills) %in% 
    c("Knock Hill")))

Coefficients:
(Intercept)         dist        climb  
  -13.53035      6.36456      0.01185

Knock Hill은 지렛대 값이 크지 않다. 따라서 Knock Hill을 제거하는 것은 적합된 모형을 크게 바꾸지는 않는다.

반면, Bens of Jura는 지렛대 값이 크고, 잔차도 크므로 적합된 모형에 영향이 있을 것이다.

update(hills.lm, 
       subset = !(rownames(hills) %in% c("Knock Hill", "Bens of Jura"))
       )


Call:
lm(formula = time ~ dist + climb, data = hills, subset = !(rownames(hills) %in% 
    c("Knock Hill", "Bens of Jura")))

Coefficients:
(Intercept)         dist        climb  
 -10.361646     6.692114     0.008047

선형모형의 개선

경주의 거리가 짧으면 기록 시간이 음수이다. 물리적으로 절편이 0인 것이 바람직하다.

summary(hills1.lm)$coefficients

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -13.5303520 2.649100461 -5.107527 1.578034e-05
dist          6.3645623 0.361129680 17.624035 1.043662e-17
climb         0.0118549 0.001234846  9.600304 8.412002e-11

예측한 시간의 정확도가 그 시간에 반비례할 필요가 있다. 시간 대신 거리를 이용한 가중치를 주어 모형을 적합하자.

summary(update(hills1.lm, weights =  1/(dist^2)))$coefficients

                Estimate  Std. Error   t value     Pr(>|t|)
(Intercept) -5.808538089 2.034247813 -2.855374 7.602533e-03
dist         5.820759581 0.536083931 10.857926 4.330482e-12
climb        0.009029245 0.001537499  5.872685 1.763612e-06

절편의 추정치는 여전히 통계적으로 유의하게 0이 아니므로 절편이 0인 모형을 적합.

lm(time ~ -1 + dist + climb, 
   data=(hills %>% 
           filter(! (rownames(.) %in% c("Knock Hill")))), 
   weights = 1/(dist^2))


Call:
lm(formula = time ~ -1 + dist + climb, data = (hills %>% filter(!(rownames(.) %in% 
    c("Knock Hill")))), weights = 1/(dist^2))

Coefficients:
    dist     climb  
4.899985  0.008472

dist로 예측 방정식을 나누고, 구배(climb/dist)를 종속변수로, 속력의 역수(time/distance)를 독립변수로 하여 선형회귀적합을 하면 위와 같은 효과를 얻을 수 있다.

hills$ispeed <- hills$time / hills$dist
hills$grad   <- hills$climb / hills$dist
(hills2.lm <- lm(ispeed ~ grad, 
                 data = (hills %>% 
                           filter(! (rownames(.) %in% c("Knock Hill")))
                         )
                 )
)


Call:
lm(formula = ispeed ~ grad, data = (hills %>% filter(!(rownames(.) %in% 
    c("Knock Hill")))))

Coefficients:
(Intercept)         grad  
   4.899985     0.008472

# (Studentized) residual plot
par(fig = c(x1 = 0,  x2 = 0.6,  y1 = 0, y2 = 1))
plot(hills$grad[-ord[1]], studres(hills2.lm), xlab = "grad", pch = 16)
abline(h = 0, lty = 2)
sr <- studres(hills2.lm)
w <- c(abs(sr)>2)
text(x = hills2.lm$model$grad[w], y = sr[w], label = names(sr)[w], pos=2)
# Q-Q plot
par(fig = c(0.6,  1,  0,  1), new = T)
qqnorm(studres(hills2.lm), pch = 16, 
       xlab = "Quantiles of Standard Normal", 
       ylab = "studres(hills2.lm)", main = NULL)
qqline(studres(hills2.lm))

hills2.hat <- lm.influence(hills2.lm)$hat  # leverage
hills %>% filter(!(rownames(.) %in% c("Knock Hill"))) %>% 
  mutate(lev = hills2.hat) %>% 
  filter(lev > 1.8*2/(nrow(hills)-1))

             dist climb    time   ispeed   grad       lev
Bens of Jura   16  7500 204.617 12.78856 468.75 0.1135367
Creag Dubh      4  2000  26.217  6.55425 500.00 0.1391534

지렛대 값이 가장 큰 두 경기는 가장 가파른 두 경기로, 서로 반대 방향으로 당기는 이상점이다. 대부분의 자료에 대해 1마일당 5분 및 1000피트당 8분의 예측이 들어맞는다.

선형모형의 부트스트랩

선형모형 돌아보기

\[ y = \boldsymbol{\beta}^T\mathbf{x} + \boldsymbol{\epsilon} \]

고정설계(fixed design)
- \(\boldsymbol{\epsilon}\)만 확률변수로 봄.
- 모든 (가정적) 반복된 관측에서 같은 점 \(\mathbf{x}\)들에 대해 반응변수의 값은 다를 수 있음
- 실험 계획법(NPK 실험, 6.7절) , 정해진 요인이 있는 관찰 연구(Quine 자료, 6.8절)

랜덤설계(random design)
- 관측된 순서쌍 \((\mathbf{x}_i, y_i)\)들이 모집단으로부터의 랜덤표본으로 간주됨.
- 관심사: 회귀함수 \(f(\mathbf{x})=E(Y|X=\mathbf{x})\)의 추정, \(f(\mathbf{x}) = \boldsymbol{\beta}^T\mathbf{x}\).
- Whiteside, Scottish hills race 자료 등; calibration 혹은 errors-in-variable 문제
- 조건부 추론: 즉, 관측된 \(\mathbf{x}\)에 조건부. 고정설계로 변환.
  - 예: Scottish hills race 자료에서의 추론들은 특정 경기(특히 Bens of Jura)가 표본에 포함되어 있는가에 의존.

선형회귀모형에서의 부트스트랩 재표집
1. 사례 기반 재표집: 표본으로부터 순서쌍 \((x_i, y_i)\)을 복원추출. 랜덤 가중회귀에 대응.
2. 모형 기반 재표집: 잔차를 복원추출 \[y_i^* = \widehat{\beta}^T\mathbf{x}_i + e^*_i\]
  - \(\widehat{\beta}\) = 적합된 회귀계수 추정량
  - \(e^*_i\) : 잔차 \(e_i\)들로부터 복원추출한 오차

모형 기반 재표집의 문제점
1. 절편이 없는 모형에서는 잔차의 평균이 0이 아닐 수 있음
  - 평균 제거
2. 잔차의 분산이 모형 오차의 분산과 같지 않고, 등분산이 아닐 수 있다.
  - 수정 잔차 \(r_i = e_i / \sqrt{1-h_{ii}}\)로부터 재표집. \(var(r_i) = \sigma^2\).

library(boot)
fit <-  lm(calls ~  year,  data = phones)
ph  <-  data.frame(phones,  res = resid(fit), fitted = fitted(fit))
ph.lm.boot <-  boot(ph,  function(data,  i) {
                            d  <-  data
                            d$calls <-  d$fitted +  d$res[i] 
                            coef(update (fit, data=d))
                         }, R = 999)  %>% print


ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = ph, statistic = function(data, i) {
    d <- data
    d$calls <- d$fitted + d$res[i]
    coef(update(fit, data = d))
}, R = 999)


Bootstrap Statistics :
       original      bias    std. error
t1* -260.059246  5.39379763    97.53728
t2*    5.041478 -0.08152095     1.57332

boot.ci(ph.lm.boot, index=1, type=c("basic", "bca"))  # intercept

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates

CALL : 
boot.ci(boot.out = ph.lm.boot, type = c("basic", "bca"), index = 1)

Intervals : 
Level      Basic                BCa          
95%   (-475.6,  -77.5 )   (-441.9,  -43.8 )  
Calculations and Intervals on Original Scale

boot.ci(ph.lm.boot, index=2, type=c("basic", "bca"))  # slope

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 999 bootstrap replicates

CALL : 
boot.ci(boot.out = ph.lm.boot, type = c("basic", "bca"), index = 2)

Intervals : 
Level      Basic                BCa          
95%   ( 2.143,  8.407 )   ( 1.725,  7.965 )  
Calculations and Intervals on Original Scale

요인설계와 실험계획법

NPK (Nitrogen, Phosphate, Potassium) 요인 실험

완두콩 재배에서의 질소, 인, 칼륨의 영향

npk %>% knitr::kable(booktab=F) %>% 
  kableExtra::kable_styling() %>% 
  kableExtra::row_spec( which(as.integer(npk$block) %% 2 == 1), 
                        color = "black", background = "lightgrey")

block	N	P	K	yield
1	0	1	1	49.5
1	1	1	0	62.8
1	0	0	0	46.8
1	1	0	1	57.0
2	1	0	0	59.8
2	1	1	1	58.5
2	0	0	1	55.5
2	0	1	0	56.0
3	0	1	0	62.8
3	1	1	1	55.8
3	1	0	0	69.5
3	0	0	1	55.0
4	1	0	0	62.0
4	1	1	1	48.8
4	0	0	1	45.5
4	0	1	0	44.2
5	1	1	0	52.0
5	0	0	0	51.5
5	1	0	1	49.8
5	0	1	1	48.8
6	1	0	1	57.2
6	1	1	0	59.0
6	0	1	1	53.2
6	0	0	0	56.0

부분요인설계 — 6개의 실험 블록 각각에서 완전설계(\(8=2^3\))의 절반이 시행 (블록 1, 5, 6 = N + P + K 짝수번)
N, P, K의 각 조합은 전체 블록에 걸쳐 3번씩 나타남.

ANOVA

options(contrasts = c("contr.helmert", "contr.poly")) # Helmert contrast
npk.aov <- aov(yield ~ block + N*P*K, data = npk) %>% print

Call:
   aov(formula = yield ~ block + N * P * K, data = npk)

Terms:
                   block        N        P        K      N:P      N:K      P:K
Sum of Squares  343.2950 189.2817   8.4017  95.2017  21.2817  33.1350   0.4817
Deg. of Freedom        5        1        1        1        1        1        1
                Residuals
Sum of Squares   185.2867
Deg. of Freedom        12

Residual standard error: 3.929447
1 out of 13 effects not estimable
Estimated effects are balanced

summary(npk.aov)

            Df Sum Sq Mean Sq F value  Pr(>F)   
block        5  343.3   68.66   4.447 0.01594 * 
N            1  189.3  189.28  12.259 0.00437 **
P            1    8.4    8.40   0.544 0.47490   
K            1   95.2   95.20   6.166 0.02880 * 
N:P          1   21.3   21.28   1.378 0.26317   
N:K          1   33.1   33.13   2.146 0.16865   
P:K          1    0.5    0.48   0.031 0.86275   
Residuals   12  185.3   15.44                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coef(npk.aov)

(Intercept)      block1      block2      block3      block4      block5 
 54.8750000   1.7125000   1.6791667  -1.8229167  -1.0137500   0.2950000 
         N1          P1          K1       N1:P1       N1:K1       P1:K1 
  2.8083333  -0.5916667  -1.9916667  -0.9416667  -1.1750000   0.1416667

블록이 교락변수로 작용하여 N:P:K 교호작용은 추정할 수 없다 (예: \(b_2 + b_3 + b_4 - b_1 - b_5 - b_6 = N*P*K\), \(\pm 1\) 코딩시).

N:P:K 교호작용은 summary()에서 생략됨

어떤 효과들이 생략되었나?

alias(npk.aov)

Model :
yield ~ block + N * P * K

Complete :
         (Intercept) block1 block2 block3 block4 block5 N1    P1    K1    N1:P1
N1:P1:K1     0           1    1/3    1/6  -3/10   -1/5      0     0     0     0
         N1:K1 P1:K1
N1:P1:K1     0     0

각 행의 열들은 헹에 선형 종속되는 효과에 해당

표준오차

N과 K의 주효과들만 유의하므로, 축소된 모형을 적합

options(contrasts = c("contr.treatment", "contr.poly")) # 처리 대조로 복원
npk.aov1 <- aov(yield ~ block + N + K, data = npk)
summary.lm(npk.aov1)


Call:
aov(formula = yield ~ block + N + K, data = npk)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.4083 -2.1438  0.2042  2.3292  7.0750 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   53.208      2.276  23.381  8.5e-14 ***
block2         3.425      2.787   1.229  0.23690    
block3         6.750      2.787   2.422  0.02769 *  
block4        -3.900      2.787  -1.399  0.18082    
block5        -3.500      2.787  -1.256  0.22723    
block6         2.325      2.787   0.834  0.41646    
N1             5.617      1.609   3.490  0.00302 ** 
K1            -3.983      1.609  -2.475  0.02487 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.942 on 16 degrees of freedom
Multiple R-squared:  0.7163,    Adjusted R-squared:  0.5922 
F-statistic: 5.772 on 7 and 16 DF,  p-value: 0.001805

질소와 칼륨을 가하는 효과는 각각 5.62와 -3.98임

대조변수의 표준오차

se.contrast(npk.aov1, list(N ==  "0",  N ==  "1"), data =  npk)

[1] 1.609175

se.contrast(npk.aov1, list(K ==  "0",  K ==  "1"), data =  npk)

[1] 1.609175

Quine 자료와 불균형 4원배치

자료

as.tibble(MASS::quine)

# A tibble: 146 × 5
   Eth   Sex   Age   Lrn    Days
   <fct> <fct> <fct> <fct> <int>
 1 A     M     F0    SL        2
 2 A     M     F0    SL       11
 3 A     M     F0    SL       14
 4 A     M     F0    AL        5
 5 A     M     F0    AL        5
 6 A     M     F0    AL       13
 7 A     M     F0    AL       20
 8 A     M     F0    AL       22
 9 A     M     F1    SL        6
10 A     M     F1    SL        6
# ℹ 136 more rows

반응변수(Days): 호주, 뉴사우스 웨일즈 주의 대도시의 아이들이 1년 간 학교에 결석한 날짜 수

설명변수
- Age: 4 levels. Primary, first, second or third form (F0–F3)
- Eth: 2 levels. Aboriginal or non-aboriginal
- Lrn: 2 levels. Slow or average learner
- Sex: 2 levels. Male or female

도수분포표

xtabs(~ Lrn + Age + Sex + Eth, data=quine)

, , Sex = F, Eth = A

    Age
Lrn  F0 F1 F2 F3
  AL  4  5  1  9
  SL  1 10  8  0

, , Sex = M, Eth = A

    Age
Lrn  F0 F1 F2 F3
  AL  5  2  7  7
  SL  3  3  4  0

, , Sex = F, Eth = N

    Age
Lrn  F0 F1 F2 F3
  AL  4  6  1 10
  SL  1 11  9  0

, , Sex = M, Eth = N

    Age
Lrn  F0 F1 F2 F3
  AL  6  2  7  7
  SL  3  7  3  0

불균형 자료 – 학습속도가 느린 학생은 F3에는 없고 그 외의 28개의 조합에는 1명 이상 있다.

이분산성(heteroskedasticity)

# the tidyverse way
quinestat <- quine %>% group_by(Eth, Sex, Age, Lrn) %>% 
  summarize(Mean = mean(Days, na.rm=TRUE), 
            Var = var(Days, na.rm=TRUE), SD = sd(Days, na.rm=TRUE))
varplot <- ggplot(quinestat, aes(x = Mean, y = Var)) + geom_point() + 
            labs(x = "Cell Means", y = "Cell Variances")
sdplot  <- ggplot(quinestat, aes(x = Mean, y = SD))  + geom_point() + 
            labs(x = "Cell Means", y = "Cell Std Devn.")
varplot + sdplot

결측값은 그래프를 그릴 때 자동적으로 생략.

셀 평균에 대한 표준편차의 대략적인 선형성은 로그 변환 또는 그와 비슷한 것이 필요하다는 것을 암시한다.

Box-Cox 변환

\[ y^{(\lambda)}= \begin{cases}\left(y^\lambda-1\right) / \lambda & \lambda \neq 0 \\ \log y & \lambda=0\end{cases} \]

Quine 자료에서는 \(y = Days + \alpha\), \(\alpha > 0\) (왜 그런가?)

고정된 \(\alpha\)

\(\alpha=1\)로 두자 (Aitkins, 1978).
\(\lambda\)의 프로파일 가능도함수 (Box and Cox, 1964) \[ \widehat{L}(\lambda) = \mathrm{const} - \frac{n}{2} \log \mathrm{RSS}(z^{(\lambda)}), \quad z^{(\lambda)} = y^{(\lambda)} / \dot{y}^{\lambda-1} \]
- \(\dot{y}\)는 관측치의 기하 평균
- \(RSS(z^{(\lambda)})\)는 \(z^{(\lambda)}\)에 대한 회귀모형의 잔차제곱합
- 가장 큰 선형 모형을 사용하는 것이 바람직함.

MASS::boxcox(Days+1 ~ Eth * Sex * Age * Lrn, data = quine, 
             singular.ok = TRUE,  lambda = seq(-0.05, 0.45, len = 20))

singular.ok: 자료에 4개의 빈 셀이 있으므로 완전모형 Days+1 ~ Eth * Sex * Age * Lrn의 모형행렬은 랭크가 부족.

로그 변환

위 그림은 \(\alpha=1\)일 때 로그 변환(\(\lambda=0\))이 최적이 아니라고 강하게 시사한다.
로그 변환을 고집한다면 고려할 수 있는 변환은 \(t(y, \alpha) = \log(y + \alpha)\)이다. Box and Cox (1964)에서와 같은 분석을 이용해 \(\alpha\)에 대한 프로파일 가능도가 \[ \hat{L}(\alpha) = \mathrm{const} - \frac{n}{2} \log\mathrm{RSS} \{ \log (y + \alpha) \} - \sum \log (y + \alpha) \] 임을 쉽게 보일 수 있다.

MASS::logtrans(Days ~ Age*Sex*Eth*Lrn, data = quine, 
               alpha = seq(0.75, 6.5, len = 20), singular.ok = TRUE)

\(\alpha=2.5\)가 바람직하다. (Aitkins (1978)은 \(\log(Days+1)\)을 사용)

모형 선택

완전모형인 Eth * Sex * Age * Lrn은 식별된 그룹 하나하나마다 서로 다른 모수를 가지므로 보다 간단한 모형들을 모두 포함.
과적합으로 인해 예측력과 설명력이 작음.
평균에 대한 더 좋은 설명력과 예측력을 위해 가능한 한 모수의 수가 적은 경제적인 모형이 필요.
주변성 제한: 고차항이 있는 모형은 저차항을 포함. 예: 2차 회귀에서 \(x^2\)항이 있는 모형은 특별한 경우가 아닌 한 \(x\)항과 절편을 포함.

변수 선택

후진선택

완전모형에서 4원 교호작용을 제외한 모형에서 3원 항을 제외한다면 어느 항을 제외해야 하는가?

(quine.hi <- aov(log(Days + 2.5) ~ .^4, MASS::quine))  # 완전모형

Call:
   aov(formula = log(Days + 2.5) ~ .^4, data = MASS::quine)

Terms:
                     Eth      Sex      Age      Lrn  Eth:Sex  Eth:Age  Eth:Lrn
Sum of Squares  10.68203  0.62388  3.76424  0.65290  0.01533  5.98964  0.01246
Deg. of Freedom        1        1        3        1        1        3        1
                 Sex:Age  Sex:Lrn  Age:Lrn Eth:Sex:Age Eth:Sex:Lrn Eth:Age:Lrn
Sum of Squares   8.68925  0.57977  2.38640     1.93915     3.74062     2.14622
Deg. of Freedom        3        1        2           3           1           2
                Sex:Age:Lrn Eth:Sex:Age:Lrn Residuals
Sum of Squares      1.46623         0.78865  63.31035
Deg. of Freedom           2               2       118

Residual standard error: 0.732481
4 out of 32 effects not estimable
Estimated effects may be unbalanced

    quine.nxt <- update(quine.hi, . ~ . - Eth:Sex:Age:Lrn) 
    MASS::dropterm(quine.nxt, test = "F")

Single term deletions

Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Age + 
    Eth:Lrn + Sex:Age + Sex:Lrn + Age:Lrn + Eth:Sex:Age + Eth:Sex:Lrn + 
    Eth:Age:Lrn + Sex:Age:Lrn
            Df Sum of Sq    RSS     AIC F Value   Pr(F)  
<none>                   64.099 -68.184                  
Eth:Sex:Age  3   0.97387 65.073 -71.982 0.60773 0.61125  
Eth:Sex:Lrn  1   1.57879 65.678 -66.631 2.95567 0.08816 .
Eth:Age:Lrn  2   2.12841 66.227 -67.415 1.99230 0.14087  
Sex:Age:Lrn  2   1.46623 65.565 -68.882 1.37247 0.25743  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Eth:Sex:Age를 제외시키면 AIC가 가장 크게 줄어든다.
일반적인 F 검정에서는 모든 3원 항이 유의하지 않다.

AIC

AIC는 일반적으로 다음과 같이 정의된다 (Akaike, 1974). AIC = -2(maximized log-likelihood) + 2(# parameters)
\(n\)개의 관측치, \(p\)개의 모수들, 정규분포를 따르는 잔차들을 갖는 회귀모형에 대해 로그 가능도는 \[ L(\beta, \alpha^{2}; \boldsymbol{y}) = \mathrm{const} - \frac{n}{2} \log \sigma^2 - \frac{1}{2 \sigma^2} \| \boldsymbol{y}-X \boldsymbol{\beta} \|^2 \]
\(\widehat{\beta}\)가 \(\beta\)에 대해 (로그)가능도함수를 최대화하는 값이라 하자. 그러면 \[ L(\widehat{\beta}, \alpha^{2}; \boldsymbol{y}) = \mathrm{const} - \frac{n}{2} \log \sigma^2 - \frac{1}{2 \sigma^2} \mathrm{RSS} \]

\(\sigma^2\)의 값이 알려져 있다면 \[ \text{AIC} = \frac{\text{RSS}}{\sigma^2} + 2p + \text{const} \] 이는 Mallow’s \(C_p\)와 동등하다. \[ C_p = \text{RSS}/ \sigma^2 + 2p - n \]
\(\sigma^2\)의 값이 알려져 있지 않다면 \[ L(\widehat{\beta}, \widehat{\alpha}^{2}; \boldsymbol{y}) = \text{const} - \frac{n}{2}\log \widehat{\sigma}^2 - \frac{n}{2}, \quad \widehat{\sigma}^2 = \text{RSS}/n \] 이므로 \[ \text{AIC} = n \log (\text{RSS}/n) + 2p + const \]

전진선택

가장 간단한 모형에서 시작해 AIC 값이 줄어드는 항을 추가하는 방법

quine.lo <- aov(log(Days+2.5) ~ 1, quine)
MASS::addterm(quine.lo, quine.hi, test = "F")

Single term additions

Model:
log(Days + 2.5) ~ 1
       Df Sum of Sq     RSS     AIC F Value     Pr(F)    
<none>              106.787 -43.664                      
Eth     1   10.6820  96.105 -57.052 16.0055 0.0001006 ***
Sex     1    0.5969 106.190 -42.483  0.8094 0.3698057    
Age     3    4.7469 102.040 -44.303  2.2019 0.0904804 .  
Lrn     1    0.0043 106.783 -41.670  0.0058 0.9392083    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Eth, Age만 (통계적으로) 유용한 것으로 보임.
궁극적으로는 높은 차수의 교호작용들이 잔차제곱합을 크게 줄여 모든 주효과 인자가 필요

단계적 선택

MASS::stepAIC: 단계별 선택 절차를 자동화
- 시작 적합 모형(최종 모델에 가까운 모델이 유리할 수 있음)
- 절차의 상위(가장 복잡) 및 하위(가장 간단) 모형을 정의하는 두 개의 공식 list 등이 필요

quine.stp <- MASS::stepAIC(quine.nxt, 
                           scope = list(upper = ~ Eth*Sex*Age*Lrn, 
                                        lower = ~1), 
                           trace = F)
quine.stp$anova

Stepwise Model Path 
Analysis of Deviance Table

Initial Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Age + 
    Eth:Lrn + Sex:Age + Sex:Lrn + Age:Lrn + Eth:Sex:Age + Eth:Sex:Lrn + 
    Eth:Age:Lrn + Sex:Age:Lrn

Final Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Age + 
    Eth:Lrn + Sex:Age + Sex:Lrn + Age:Lrn + Eth:Sex:Lrn + Eth:Age:Lrn

           Step Df Deviance Resid. Df Resid. Dev       AIC
1                                 120   64.09900 -68.18396
2 - Eth:Sex:Age  3 0.973869       123   65.07287 -71.98244
3 - Sex:Age:Lrn  2 1.526754       125   66.59962 -72.59652

남은 비주변 항의 유의성 확인

MASS::dropterm(quine.stp, test = "F")

Single term deletions

Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Age + 
    Eth:Lrn + Sex:Age + Sex:Lrn + Age:Lrn + Eth:Sex:Lrn + Eth:Age:Lrn
            Df Sum of Sq    RSS     AIC F Value     Pr(F)    
<none>                   66.600 -72.597                      
Sex:Age      3   10.7959 77.396 -56.663  6.7542 0.0002933 ***
Eth:Sex:Lrn  1    3.0325 69.632 -68.096  5.6916 0.0185476 *  
Eth:Age:Lrn  2    2.0960 68.696 -72.072  1.9670 0.1441822    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Eth:Age:Lrn 항은 5% 유의수준에서 유의하지 않다.

이는 AIC에 근거해 선택한 항들이 다소 관대함을 시사. 더 경제적인 모형을 얻기 위해 수동 제거

quine.3 <- update(quine.stp, . ~ . - Eth:Age:Lrn) 
MASS::dropterm(quine.3, test = "F")

Single term deletions

Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Age + 
    Eth:Lrn + Sex:Age + Sex:Lrn + Age:Lrn + Eth:Sex:Lrn
            Df Sum of Sq    RSS     AIC F Value     Pr(F)    
<none>                   68.696 -72.072                      
Eth:Age      3    3.0312 71.727 -71.768  1.8679 0.1383323    
Sex:Age      3   11.4272 80.123 -55.607  7.0419 0.0002037 ***
Age:Lrn      2    2.8149 71.511 -70.209  2.6020 0.0780701 .  
Eth:Sex:Lrn  1    4.6956 73.391 -64.419  8.6809 0.0038268 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    quine.4 <- update(quine.3, . ~ . - Eth:Age) 
    MASS::dropterm(quine.4, test = "F")

Single term deletions

Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Lrn + 
    Sex:Age + Sex:Lrn + Age:Lrn + Eth:Sex:Lrn
            Df Sum of Sq    RSS     AIC F Value     Pr(F)    
<none>                   71.727 -71.768                      
Sex:Age      3   11.5656 83.292 -55.942  6.9873 0.0002147 ***
Age:Lrn      2    2.9118 74.639 -69.959  2.6387 0.0752793 .  
Eth:Sex:Lrn  1    6.8181 78.545 -60.511 12.3574 0.0006052 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    quine.5 <- update(quine.4, . ~ . - Age:Lrn) 
    MASS::dropterm(quine.5, test = "F")

Single term deletions

Model:
log(Days + 2.5) ~ Eth + Sex + Age + Lrn + Eth:Sex + Eth:Lrn + 
    Sex:Age + Sex:Lrn + Eth:Sex:Lrn
            Df Sum of Sq    RSS     AIC F Value     Pr(F)    
<none>                   74.639 -69.959                      
Sex:Age      3    9.9002 84.539 -57.774  5.8362 0.0008944 ***
Eth:Sex:Lrn  1    6.2988 80.937 -60.130 11.1396 0.0010982 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Aitkin (1978)이 얻은 최종모형과 동일

다중 비교

다중비교 조정

사전에 계획된 실험에 의해 이루어지는 회귀분석에서 대조(contrast)에 대한 다중비교는 사후에 조정할 수 있다(post hoc adjustment).
9강 참조

Immer 자료

as.tibble(MASS::immer)  # 보리 생산량에 대한 실험자료

# A tibble: 30 × 4
   Loc   Var      Y1    Y2
   <fct> <fct> <dbl> <dbl>
 1 UF    M      81    80.7
 2 UF    S     105.   82.3
 3 UF    V     120.   80.4
 4 UF    T     110.   87.2
 5 UF    P      98.3  84.2
 6 W     M     147.  100. 
 7 W     S     142   116. 
 8 W     V     151.  112. 
 9 W     T     192.  148. 
10 W     P     146.  108. 
# ℹ 20 more rows

변수
- Loc: 실험농장 (6개)
- Var: 보리 품종 (5가지)
- Y1: 1931년 보리 생산량
- Y2: 1932년 보리 생산량

관심사 : 품종별 생산량 차이

immer.aov <- aov((Y1 + Y2)/2 ~ Var + Loc, data = immer)  # 2년간 평균 생산량 분석
summary(immer.aov)

            Df Sum Sq Mean Sq F value   Pr(>F)    
Var          4   2655   663.7   5.989  0.00245 ** 
Loc          5  10610  2122.1  19.148 5.21e-07 ***
Residuals   20   2217   110.8                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

품종별 생산량은 통계적으로 유의한 차이가 있다.

품종별 평균 생산량

model.tables(immer.aov, type = "means", se = TRUE, cterms = "Var")

Tables of means
Grand mean

101.09 

 Var 
Var
     M      P      S      T      V 
 94.39 102.54  91.13 118.20  99.18 

Standard errors for differences of means
          Var
        6.078
replic.     6

유의수준 5%에서 평균 생산량의 차이에 대한 쌍별 검정을 한다면 \(6.078 \times t_{20}(0.975) \approx 12.6\)이므로 품종 T와 다른 품종들간의 평균 생산량 차이는 유의할 것이라는 것을 시사한다.

하지만 이 비교는 적합 결과를 본 후에 정해진 것이다. 즉, 명시적으로 계산되지 않더라도 다른 모든 비교가 암시적으로 이루어진 것이다.
적합을 확인하기 전에 비교가 결정되었다고 받아들일 수 있다면 단순 조정을 행할 수 있으나 이런 경우는 거의 없으며, 더 나아가 이것이 정말로 사용자의 의도였다고 다른 사람들을 설득하기 어려울 수 있다.
따라서 모든 쌍별 비교에 대해 조정을 행하는 것이 일반적이다.

동시 신뢰구간

(tk <- confint(multcomp::glht(immer.aov, linfct = mcp(Var = "Tukey"))) )


     Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = (Y1 + Y2)/2 ~ Var + Loc, data = immer)

Quantile = 2.9929
95% family-wise confidence level
 

Linear Hypotheses:
           Estimate lwr      upr     
P - M == 0   8.1500 -10.0410  26.3410
S - M == 0  -3.2583 -21.4493  14.9326
T - M == 0  23.8083   5.6174  41.9993
V - M == 0   4.7917 -13.3993  22.9826
S - P == 0 -11.4083 -29.5993   6.7826
T - P == 0  15.6583  -2.5326  33.8493
V - P == 0  -3.3583 -21.5493  14.8326
T - S == 0  27.0667   8.8757  45.2576
V - S == 0   8.0500 -10.1410  26.2410
V - T == 0 -19.0167 -37.2076  -0.8257

plot(tk)

품종 T의 수확량이 품종 P와 크게 다르다는 결론을 내릴 수 없다.

Oats 자료

as.tibble(MASS::oats)

# A tibble: 72 × 4
   B     V           N          Y
   <fct> <fct>       <fct>  <int>
 1 I     Victory     0.0cwt   111
 2 I     Victory     0.2cwt   130
 3 I     Victory     0.4cwt   157
 4 I     Victory     0.6cwt   174
 5 I     Golden.rain 0.0cwt   117
 6 I     Golden.rain 0.2cwt   114
 7 I     Golden.rain 0.4cwt   161
 8 I     Golden.rain 0.6cwt   141
 9 I     Marvellous  0.0cwt   105
10 I     Marvellous  0.2cwt   140
# ℹ 62 more rows

oats1 <- aov(Y ~ N + V + B, data = oats)
summary(oats1)

            Df Sum Sq Mean Sq F value   Pr(>F)    
N            3  20020    6673  28.460 1.24e-11 ***
V            2   1786     893   3.809   0.0276 *  
B            5  15875    3175  13.540 6.91e-09 ***
Residuals   61  14304     234                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(glht(oats1, linfct = mcp(V = "Tukey")))


     Simultaneous Confidence Intervals

Multiple Comparisons of Means: Tukey Contrasts


Fit: aov(formula = Y ~ N + V + B, data = oats)

Quantile = 2.4024
95% family-wise confidence level
 

Linear Hypotheses:
                              Estimate lwr      upr     
Marvellous - Golden.rain == 0   5.2917  -5.3280  15.9113
Victory - Golden.rain == 0     -6.8750 -17.4946   3.7446
Victory - Marvellous == 0     -12.1667 -22.7863  -1.5470

대조군과의 비교

lmat <- matrix(c(0, -1, 1, rep(0, 11), 
                 0, -1, 0, 1, rep(0,10),
                 0, -1, 0, 0, 1, rep(0,9)), , 3, dimnames = list(NULL,
               c("0.2cwt-0.0cwt", "0.4cwt-0.0cwt", "0.6cwt-0.0cwt")))
lmat

      0.2cwt-0.0cwt 0.4cwt-0.0cwt 0.6cwt-0.0cwt
 [1,]             0             0             0
 [2,]            -1            -1            -1
 [3,]             1             0             0
 [4,]             0             1             0
 [5,]             0             0             1
 [6,]             0             0             0
 [7,]             0             0             0
 [8,]             0             0             0
 [9,]             0             0             0
[10,]             0             0             0
[11,]             0             0             0
[12,]             0             0             0
[13,]             0             0             0
[14,]             0             0             0

confint(glht(oats1, linfct = mcp(N = t(lmat[2:5, ]))))


     Simultaneous Confidence Intervals

Multiple Comparisons of Means: User-defined Contrasts


Fit: aov(formula = Y ~ N + V + B, data = oats)

Quantile = 2.4088
95% family-wise confidence level
 

Linear Hypotheses:
                   Estimate lwr     upr    
0.2cwt-0.0cwt == 0 19.5000   7.2046 31.7954
0.4cwt-0.0cwt == 0 34.8333  22.5379 47.1288
0.6cwt-0.0cwt == 0 44.0000  31.7046 56.2954

모든 질소 증가가 생산량에 유의미한 증가로 이어지는지 알아보자.

lmat <- matrix(c(0,-1,1,rep(0, 11), 0,0,-1,1, rep(0,10),
                 0,0,0,-1,1,rep(0,9)),,3,
               dimnames = list(NULL,
               c("0.2cwt-0.0cwt", "0.4cwt-0.2cwt", "0.6cwt-0.4cwt")))
lmat

      0.2cwt-0.0cwt 0.4cwt-0.2cwt 0.6cwt-0.4cwt
 [1,]             0             0             0
 [2,]            -1             0             0
 [3,]             1            -1             0
 [4,]             0             1            -1
 [5,]             0             0             1
 [6,]             0             0             0
 [7,]             0             0             0
 [8,]             0             0             0
 [9,]             0             0             0
[10,]             0             0             0
[11,]             0             0             0
[12,]             0             0             0
[13,]             0             0             0
[14,]             0             0             0

confint(glht(oats1, linfct = mcp(N = t(lmat[2:5, ])), alternative = "greater"))


     Simultaneous Confidence Intervals

Multiple Comparisons of Means: User-defined Contrasts


Fit: aov(formula = Y ~ N + V + B, data = oats)

Quantile = -2.1719
95% family-wise confidence level
 

Linear Hypotheses:
                   Estimate lwr     upr    
0.2cwt-0.0cwt <= 0 19.5000   8.4139     Inf
0.4cwt-0.2cwt <= 0 15.3333   4.2472     Inf
0.6cwt-0.4cwt <= 0  9.1667  -1.9195     Inf