제 3강: 그래픽 문법

데이터과학 입문

Author

Affiliation

원중호

서울대학교 통계학과

Published

March 2024

시작하기 전에

다음의 패키지가 설치되어 있지 않으면 설치한다.

# install.packages("mdsr")
# install.packages("tidyverse")
# install.packages("NHANES")
# install.packages("macleish")
# install.packages("ggmosaic")
# install.packages("ggraph")
# install.packages("tidygraph")
# install.packages("babynames")
library(mdsr)
library(tidyverse)
library(NHANES)
library(macleish)
library(ggmosaic)
library(ggraph)
library(tidygraph)
library(babynames)
sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Seoul
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] babynames_1.0.1 tidygraph_1.3.1 ggraph_2.2.0    ggmosaic_0.3.3 
 [5] macleish_0.3.9  etl_0.4.1       NHANES_2.1.0    lubridate_1.9.3
 [9] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2    
[13] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.0  
[17] tidyverse_2.0.0 mdsr_0.2.7     

loaded via a namespace (and not attached):
 [1] gtable_0.3.4       xfun_0.42          htmlwidgets_1.6.4  ggrepel_0.9.5     
 [5] tzdb_0.4.0         vctrs_0.6.5        tools_4.3.3        generics_0.1.3    
 [9] proxy_0.4-27       fansi_1.0.6        pkgconfig_2.0.3    KernSmooth_2.23-22
[13] data.table_1.15.2  skimr_2.1.5        lifecycle_1.0.4    farver_2.1.1      
[17] compiler_4.3.3     munsell_0.5.0      ggforce_0.4.2      repr_1.1.6        
[21] graphlayouts_1.1.0 htmltools_0.5.7    class_7.3-22       yaml_2.3.8        
[25] lazyeval_0.2.2     plotly_4.10.4      pillar_1.9.0       MASS_7.3-60.0.1   
[29] classInt_0.4-10    cachem_1.0.8       viridis_0.6.5      tidyselect_1.2.0  
[33] digest_0.6.34      stringi_1.8.3      sf_1.0-15          polyclip_1.10-6   
[37] fastmap_1.1.1      grid_4.3.3         colorspace_2.1-0   cli_3.6.2         
[41] magrittr_2.0.3     base64enc_0.1-3    utf8_1.2.4         e1071_1.7-14      
[45] withr_3.0.0        scales_1.3.0       timechange_0.3.0   rmarkdown_2.25    
[49] httr_1.4.7         igraph_2.0.2       gridExtra_2.3      hms_1.1.3         
[53] memoise_2.0.1      evaluate_0.23      knitr_1.45         viridisLite_0.4.2 
[57] rlang_1.1.3        Rcpp_1.0.12        glue_1.7.0         DBI_1.2.2         
[61] tweenr_2.0.3       rstudioapi_0.15.0  jsonlite_1.8.8     R6_2.5.1          
[65] units_0.8-5

데이터 그래픽을 위한 문법

`ggplot2`

이 강의에서는 tidyverse의 ggplot2를 이용한 데이터 그래픽을 설명한다.
R에서는 기본 제공되는 기본(base) 그래픽과 lattice 시스템으로 정적 2차원 데이터 그래픽을 만들 수 있으나, ggplot2는 그래픽 문법을 사용하여 체계적으로 데이터 그래픽을 만들 수 있다.

ggplot2 그래픽 문법의 중요 요소

Aesthetic: 변수와 그 값을 나타내는 시각적 단서 사이의 명시적 대응
Glyph: 하나의 관측단위(case)를 나타내는 기본 그래픽 요소(‘마크’ 또는 ‘심볼’).

예: 산점도

Glyph: 점 (또는 마크)
시각적 단서: glyph의 가로세로 위치 - 대응되는 수량이 얼마나 큰지 이해하는 데 도움
Aesthetic: 위의 대응을 정의하는 사상(mapping)
- 변수가 두 개 이상인 경우, 추가적인 aesthetics로 다른 시각적 단서를 통합
- 시계열의 방향과 같은 일부 시각적 단서는 암묵적이며 해당하는 aesthetic이 없음

사용 자료: msdr::CIACountiries 데이터 테이블에는 236개 국가별로 수집된 7가지 변수를 포함.
- 인구(pop), 면적(area), 국내총생산(gdp), 교육에 지출되는 GDP 비율(educ), 단위 면적당 도로 길이(roadways), 인구 대비 인터넷 사용 비율(net_users), 하루 생산되는 석유 배럴 수(oil_prod)

mdsr::CIACountries %>% select(-area, -pop) %>% head

         country oil_prod   gdp educ   roadways net_users
1    Afghanistan        0  1900   NA 0.06462444       >5%
2        Albania    20510 11900  3.3 0.62613051      >35%
3        Algeria  1420000 14500  4.3 0.04771929      >15%
4 American Samoa        0 13000   NA 1.21105528      <NA>
5        Andorra       NA 37200   NA 0.68376068      >60%
6         Angola  1742000  7300  3.5 0.04125211      >15%

Aesthetics

g <- ggplot(data = CIACountries, aes(y = gdp, x = educ))
g + geom_point(size = 3)

Figure 1: Scatterplot using only the position aesthetic for glyphs.

ggplot() 명령으로 도표 객체 g를 생성
data: 도표의 어느 곳에서든 언급된 변수는 data 인수에 지정된 CIACountries 데이터프레임 내에 있는 것으로 이해
ggplot2의 그래픽은 요소별로 점진적으로 만들어진다.

g에서 aesthetics는 2개로, aes() 함수를 이용하여 수직(y) 좌표를 gdp 변수에, 수평(x) 좌표를 educ 변수에 대응시킨다.
유일한 glyph는 점으로, geom_point()를 이용해서 덧붙인다.
- geom_point()의 인수는 점이 그려지는 위치와 방법을 지정.
- size 인수는 모든 glyph의 크기를 변경.

Figure 1 에서 크기는 aesthetic이 아님.
- 모든 점의 크기가 동일 — 변수를 시각적 단서에 대응시키지 않음
점 하나는 country를 나타냄 (왜 그런가?).

또한 각 축에 변수를 매핑하는 것 외에도 다음과 같이 여러 시각적인 속성에 변수를 매핑할 수 있음.
- col : 시각화에 사용되는 색 지정
- label : 시각화를 통해 표현되는 label 지정
- size : 시각화에 사용되는 크기 지정

여러 속성을 갖는 glyph

g + geom_point(aes(color = net_users), size = 3)

Figure 2: Scatterplot in which net_users is mapped to color.

각 점의 색상을 범주형 net_users 변수에 대응시켜 (aesthetic 추가) Figure 1 를 확장

Glyph 바꾸기

g + geom_text(aes(label = country, color = net_users), size = 3)

Figure 3: Scatterplot using both location and label as aesthetics.

점 -> 텍스트

더 많은 aesthetics

g + geom_point(aes(color = net_users, size = roadways))

Figure 4: Scatterplot in which net_users is mapped to color and educ mapped to size.

educ -> 가로축상 위치
gdp -> 세로축상 위치
net_users -> 색상
roadways -> 점 크기

Glyph-ready data

점 하나는 country를 나타냄 (왜 그런가?).

ggplot(data = CIACountries)

모든 데이터프레임이 이렇지는 않음. 6장 참조.

척도

Figure 4 에서 GDP값이 오른쪽으로 꼬리가 긴(right-skewed) 분포를 가지기 때문에 값이 작은 부분에서는 차이를 느끼기 어려움.

g + geom_point(aes(color = net_users, size = roadways)) +
    coord_trans(y = "log10")

Figure 5: Scatterplot using a logarithmic transformation of GDP that helps to mitigate visual clustering caused by the right-skewed distribution of GDP among countries.

선형 척도 -> (상용)로그 척도 (coord_trans())
모든 척도가 위치에 관한 것은 아님: net_users -> 색상(qualitative); roadways -> 점 크기

가이드

맥락을 제공하여 시각적 단서에 의미 부여
위치 정보: 축, 눈금, 레이블
범례: net_users -> 색상 등 대응 설명

Facets

여러 개의 나란히 놓인 도표

한 도표에 너무 많은 aesthetics(모양, 색상, 크기 등)을 한꺼번에 표시하는 것은 너무 많은 정보를 주어 혼란을 줄 수 있음.
Facet은 여러 그래프를 범주에 따라 병렬적으로 그려 주어 보다 효과적으로 다양한 정보를 제공할 수 있음.

`facet_wrap()`

g + geom_point(alpha = 0.9, aes(size = roadways)) +
    coord_trans(y = "log10") +
    facet_wrap(~net_users, nrow = 1) +
    theme(legend.position = "top")

Scatterplot using facets for different ranges of Internet connectivity.

범주형 변수 하나로 구분

층

두 개 이상의 자료표의 데이터를 그래프로 표현해야 할 필요가 종종 있음

메디케어 자료

MedicareCharges 및 MedicareProviders 자료표는 미국 각 주의 의료 절차의 평균 비용에 대한 정보를 담고 있음.

MedicareCharges 표에서 각 행은 각 주에서 관련 평균 비용과 함께 서로 다른 의료 절차(drg)를 나타냄.

    ChargesNJ <- MedicareCharges %>% 
                    filter(stateProvider == "NJ")  # New Jersey only

p <- ggplot(
      data = ChargesNJ,
      aes(x = reorder(drg, mean_charge), y = mean_charge)
     ) +
     geom_col(fill = "gray") +
     ylab("Statewide Average Charges ($)") +
     xlab("Medical Procedure (DRG)") +
     theme(axis.text.x = element_text(angle = 90, hjust = 1, size = rel(0.5)))
p

Figure 6: Bar graph of average charges for medical procedures in New Jersey.

Aesthetic: drg (mean_charge에 따라 오름차순으로 정렬) -> x, mean_charge -> y
Glyph: 막대 (geom_col())

다른 주와의 비교

p + geom_point(data = MedicareCharges, size = 1, alpha = 0.3)

Figure 7: Bar graph adding a second layer to provide a comparison of New Jersey to other states. Each dot represents one state, while the bars represent New Jersey.

Glyphs: 막대 — 뉴저지, 점 — 미국 전역의 주
뉴저지의 진료비가 전 의료 절차에 걸쳐 대해 전국에서 가장 높은 수준이라는 것을 쉽게 알 수 있음

R의 표준 데이터 그래픽

1변수 도표

통계학에서 표준적인 데이터 그래픽(Tukey, 1990)은 화려하지는 않으나 단순하고 효과적
종종 하나의 변수에 대한 분포를 이해하는 과정이 필요함

히스토그램

SAT_2010 자료에서 수학 점수(math)를 x에 대응 (수치형).

g <- ggplot(data = SAT_2010, aes(x = math))

geom_histogram()

g + geom_histogram(binwidth = 10) + labs(x = "Average Math SAT score")

Figure 8: Histogram showing the distribution of math SAT scores by state.

binwidth 인수로 bin의 넓이를 조절해가면서 자신의 데이터에 가장 적절한 값을 결정해야 함
가로축: SAT 수학 점수 (수치형)
- 선형 척도
- 시각적 단서: 위치 및 방향
좌표계: 데카르트 좌표계

밀도 도표

geom_density()로 같은 자료를 핵평활화(kernel smoothing)

g + geom_density(adjust = 0.3)

Figure 9: Density plot showing the distribution of average math SAT scores by state.

adjust: geom_histogram()의 binwidth와 비슷한 역할 (핵 대역폭 조절)

막대그래프

SAT_2010 자료의 주별(state) 수학 점수 평균 분포 (범주형)
geom_col()

bc <- ggplot(
  data = head(SAT_2010, 10),  # only the first 10 states (in alphabetical order)
  aes(x = reorder(state, math), y = math)
) +
  geom_col() +  # sort the state names in order of their average math SAT score
  labs(x = "State", y = "Average Math SAT score")

Figure 10: A bar plot showing the distribution of average math SAT scores for a selection of states.

누적 막대 도표

분할표의 도시

ggplot(data = mosaicData::HELPrct, aes(x = homeless)) + 
  geom_bar(aes(fill = substance), position = "fill") +
  scale_fill_brewer(palette = "Spectral") + 
  coord_flip()

Figure 11: A stacked bar plot showing the distribution of substance of abuse for participants in the HELP study.

다변량 도표

두 개 이상의 변수 사이의 관계를 전달

산점도

두 수치형 변수의 관계
좌표계: 데카르트, x = (변수 1), y = (변수 2)

g <- ggplot(
  data = SAT_2010, 
  aes(x = expenditure, y = math)   # expenditure per pupil (1k USD) 
) + 
  geom_point()
g

Figure 12: A scatter plot showing the relationship between the average SAT math score and the expenditure per pupil.

산점도 위에 선형 회귀선을 그려 두 변수 사이의 관계를 더 잘 설명할 수 있다.

g <- g + 
  geom_smooth(method = "lm", se = FALSE) + 
  xlab("Average expenditure per student ($1000)") +
  ylab("Average score on math SAT")
g

Figure 13: A scatter plot with the simple linear regression line showing the relationship between the average SAT math score and the expenditure per pupil.

층 추가: SAT_rate (low, medium, high score)
Aesthetic 추가: SAT_rate -> color

SAT_2010 <- SAT_2010 %>%
  mutate(
    SAT_rate = cut(
      sat_pct, 
      breaks = c(0, 30, 60, 100), 
      labels = c("low", "medium", "high")
    )
  )
g <- g %+% SAT_2010  # update the data frame that is bound to our plot
g + aes(color = SAT_rate)

Figure 14: Scatterplot using the color aesthetic to separate the relationship between two numeric variables by a third categorical variable.

Faceting: facet_wrap()

g + facet_wrap(~ SAT_rate)

NHANES 자료

미국 국민 건강 및 영양조사를 통해 얻은 자료로 개인의 체형이나 성별, 나이 등의 정보를 포함
Height vs Age, by Gender

ggplot(
  data = slice_sample(NHANES::NHANES, n = 1000), 
  aes(x = Age, y = Height, color = fct_relevel(Gender, "male")) # reset factor levels
) + 
  geom_point() + 
  geom_smooth() + 
  xlab("Age (years)") + 
  ylab("Height (cm)") +
  labs(color = "Gender")

Figure 16: A scatterplot for 1,000 random individuals from the NHANES study. Note how mapping gender to color illuminates the differences in height between men and women.

시계열

시간을 가로축으로 하고 점들을 선으로 연결하여 시간적 연속성을 나타내는 산점도
whately_2015: 2015년 미국 매사추세츠주 서부의 날씨 관측 자료

wp <- ggplot(data = macleish::whately_2015, aes(x = when, y = temperature)) + 
  geom_line(color = "darkgray") + 
  geom_smooth() + 
  xlab(NULL) + 
  ylab("Temperature (degrees Celsius)")

Figure 17: A time series showing the change in temperature at the MacLeish field station in 2015.

10분 간격으로 측정되어 시간에 따른 변동이 심함 — 설명변수로서 적절한가?

상자 수염 도표

수치형과 범주형 변수 간의 관계 도시
시간을 월별로 묶어 범주형으로 변환하여 달과 기온의 관계를 생각해볼 수 있다.

whately_2015 %>%
mutate(month = as.factor(lubridate::month(when, label = TRUE))) %>%
group_by(month) %>%
skim(temperature) %>%
select(-na)

Variable type: numeric

var	month	n	mean	sd	p0	p25	p50	p75	p100
temperature	Jan	4464	-6.37	5.14	-22.28	-10.26	-6.25	-2.35	6.16
temperature	Feb	4032	-9.26	5.11	-22.21	-12.26	-9.43	-5.50	4.27
temperature	Mar	4464	-0.87	5.06	-16.16	-4.61	-0.55	2.99	13.47
temperature	Apr	4320	8.04	5.51	-3.04	3.77	7.61	11.79	22.68
temperature	May	4464	17.36	5.94	2.29	12.84	17.48	21.43	31.38
temperature	Jun	4320	17.75	5.11	6.53	14.20	17.95	21.23	29.45
temperature	Jul	4464	21.56	3.90	12.05	18.56	21.22	24.30	32.11
temperature	Aug	4464	21.45	3.79	12.86	18.42	21.07	24.29	31.15
temperature	Sep	4320	19.28	5.07	5.43	15.75	19.00	22.51	33.08
temperature	Oct	4464	9.79	5.00	-3.97	6.58	9.49	13.33	22.30
temperature	Nov	4320	7.28	5.65	-4.84	3.14	7.11	10.81	22.81
temperature	Dec	4464	4.95	4.59	-6.16	1.61	5.15	8.38	18.44

bp <- ggplot(
    data = whately_2015,
    aes(
        x = lubridate::month(when, label = TRUE),
        y = temperature
    )
) +
geom_boxplot() +
xlab("Month") +
ylab("Temperature (degrees Celsius)")

Figure 18: A box-and-whisker of temperatures by month at the MacLeish field station.

각 월별 요약값(최솟값, Q1, 중앙값, Q3, 최댓값)을 그림으로 표현함.

모자이크 도표

설명변수와 반응변수가 모두 범주형인 경우
나이와 BMI에 따른 당뇨병 환자 비율

mosaic_to_plot <- NHANES %>%
  filter(Age > 19) %>%
  mutate(AgeDecade = droplevels(AgeDecade)) %>%
  select(AgeDecade, Diabetes, BMI_WHO) %>% 
  na.omit()

mp <- ggplot(mosaic_to_plot) +
  geom_mosaic(
    aes(x = product(BMI_WHO, AgeDecade), fill = Diabetes)
  ) + 
  ylab("BMI") + 
  xlab("Age (by decade)") + 
  coord_flip()

Figure 19: Mosaic plot (eikosogram) of diabetes by age and weight status (BMI).

상자의 넓이는 각 셀의 관측치에 비례
당뇨병은 나이가 많고 비만인 사람에게 더 흔하게 나타난다.

기본 데이터 그래픽 정리

반응변수(`y`)	설명변수(`x`)	도표 종류	`geom_*()`
	수치형	히스토그램, 밀도	`geom_histogram()`, `geom_density()`
	범주형	누적 막대	`geom_bar()`
수치형	수치형	산점도	`geom_point()`
수치형	범주형	상자 수염	`geom_boxplot()`
범주형	범주형	모자이크	`geom_mosaic()`

지도

계급구분도: 각 지역의 색상으로 변수 값을 반영
CIACountries에서 국가별 석유 생산량 도시

Figure 20: A choropleth map displaying oil production by countries around the world in barrels per day.

네트워크

정점(vertices, nodes)라고 불리는 개체 사이의 관계를 호(edges)라고 불리는 연결로 나타냄

`NCI60` 자료

60개 암종의 유전자 발현에 대한 40,000개 이상의 프로브를 포함

CellEdges <- Cancer
SmallEdges <- head(CellEdges,200)
g <- SmallEdges %>%
  select(cellLine, otherCellLine, correlation) %>%
  as_tbl_graph(directed = FALSE) %>%
  mutate(type = substr(name, 0, 2))
cellnet <- ggraph::ggraph(g, layout = 'kk') +
  geom_edge_arc(aes(width = correlation), color = "lightgray", strength = 0.2) +
  geom_node_point(aes(color = type), size = 10, alpha = 0.6) +
  geom_node_text(aes(label = type)) + 
  scale_edge_width_continuous(range = c(0.1, 1)) +
  guides(color = guide_legend(override.aes = list(size = 6))) + 
  theme_void() + 
  coord_cartesian(clip = "off")

A network diagram displaying the relationship between types of cancer cell lines.

특정 세포주간의 상관관계 네트워크. 정점=세포주, 색상: 암종(ovarian, colon, central nervous system, melanoma, renal, breast, lung)

흑색종 세포주(ME)가 서로 밀접한 관련이 있지만 다른 세포주와는 그다지 관련이 없어 보임.
- 대장암(CO)과 중추신경계(CN)도 마찬가지
반면 폐암은 여러 다른 유형의 암과 연관성이 있는 경향

아기 이름의 역사

이름으로 상대방의 나이를 알 수 있는 방법

FiveThirtyEight 의 분석 결과를 재현
babynames 자료: 미국 사회보장국(SSA)의 공개 자료

BabynamesDist <- mdsr::make_babynames_dist()
BabynamesDist

# A tibble: 1,639,722 × 9
    year sex   name          n   prop alive_prob count_thousands age_today
   <dbl> <chr> <chr>     <int>  <dbl>      <dbl>           <dbl>     <dbl>
 1  1900 F     Mary      16706 0.0526          0           16.7        114
 2  1900 F     Helen      6343 0.0200          0            6.34       114
 3  1900 F     Anna       6114 0.0192          0            6.11       114
 4  1900 F     Margaret   5304 0.0167          0            5.30       114
 5  1900 F     Ruth       4765 0.0150          0            4.76       114
 6  1900 F     Elizabeth  4096 0.0129          0            4.10       114
 7  1900 F     Florence   3920 0.0123          0            3.92       114
 8  1900 F     Ethel      3896 0.0123          0            3.90       114
 9  1900 F     Marie      3856 0.0121          0            3.86       114
10  1900 F     Lillian    3414 0.0107          0            3.41       114
# ℹ 1,639,712 more rows
# ℹ 1 more variable: est_alive_today <dbl>

생존 인구 비율

각 이름을 가진 사람 중 현재 살아 있는 사람 수를 추정
우선 남자 아이 이름인 Joseph에 대해 재현

joseph <- BabynamesDist %>%
  filter(name == "Joseph" & sex == "M")
name_plot <- ggplot(data = joseph, aes(x = year))

geom_col()을 이용하여 연도별 태어난 사람 중 현재 살아 있는 사람 수에 대한 막대 그래프를 구함

name_plot <- name_plot +
  geom_col(
    aes(y = count_thousands * alive_prob), 
    fill = "#b2d7e9", 
    color = "white",
    size = 0.1
  )

geom_line()을 이용하여 연도별 태어난 사람 수를 연속적으로 표현하는 선 그래프를 더함

name_plot <- name_plot + 
  geom_line(aes(y = count_thousands), size = 2)
name_plot <- name_plot +
  ylab("Number of People (thousands)") + 
  xlab(NULL)
name_plot

생년의 중앙값 계산 (현재 생존한 것으로 추정되는 인구 수에 따라 가중치)

wtd_quantile <- Hmisc::wtd.quantile   # rename
median_yob <- joseph %>%
  summarize(
    year = wtd_quantile(year, est_alive_today, probs = 0.5)
  ) %>% 
  pull(year)
median_yob

 50% 
1975

생년의 중앙값 도시

name_plot <- name_plot +
  geom_col(
    color = "white", fill = "#008fd5", 
    aes(y = ifelse(year == median_yob, est_alive_today / 1000, 0))
  )

그래프 제목과 맥락을 설정하여 최종 도표 완성

context <- tribble(
  ~year, ~num_people, ~label,
  1935, 40, "Number of Josephs\nborn each year",
  1915, 13, "Number of Josephs\nborn each year
  \nestimated to be alive\non 1/1/2014", 
  2003, 40, "The median\nliving Joseph\nis 37 years old", 
)

joe <- name_plot +
  ggtitle("Age Distribution of American Boys Named Joseph") + 
  geom_text(
    data = context, 
    aes(y = num_people, label = label, color = label)
  ) + 
  geom_curve(
    x = 1990, xend = 1974, y = 40, yend = 24, 
    arrow = arrow(length = unit(0.3, "cm")), curvature = 0.5
  ) + 
  scale_color_manual(
    guide = "none", 
    values = c("black", "#b2d7e9", "darkgray")
  ) + 
  ylim(0, 42)

Figure 21: Recreation of the age distribution of “Joseph” plot.

name_plot의 data 인수를 수정하여 다른 이름에 대한 유사한 그래프를 얻음

name_plot %+% filter(
  BabynamesDist, 
  name == "Josephine" & sex == "F"
)

Figure 22: Age distribution of American girls named “Josephine.”

같은 이름, 다른 성별

facet_wrap()

names_plot <- name_plot + 
  facet_wrap(~sex)
names_plot %+% filter(BabynamesDist, name == "Jessie")

Figure 23: Comparison of the name “Jessie” across two genders.

몇 가지 이름의 성별 분포

facet_grid()

many_names_plot <- name_plot + 
  facet_grid(name ~ sex)
mnp <- many_names_plot %+% filter(
  BabynamesDist, 
  name %in% c("Jessie", "Marion", "Jackie")
)
mnp + facet_grid(sex ~ name)

Figure 24: Gender breakdown for the three most unisex names.

가장 흔한 여자 이름

먼저 현재 살아있는 것으로 추정되는 사람들 중 가장 흔한 여성 이름 25개를 찾는다.

com_fem <- BabynamesDist %>%
  filter(n > 100, sex == "F") %>% 
  group_by(name) %>%
  mutate(wgt = est_alive_today / sum(est_alive_today)) %>%
  summarize(
    N = n(), 
    est_num_alive = sum(est_alive_today),
    quantiles = list(
      wtd_quantile(
        age_today, est_alive_today, probs = 1:3/4, na.rm = TRUE
      )
    )
  ) %>%
  mutate(measures = list(c("q1_age", "median_age", "q3_age"))) %>%
  unnest(cols = c(quantiles, measures)) %>%
  pivot_wider(names_from = measures, values_from = quantiles) %>%
  arrange(desc(est_num_alive)) %>%
  head(25)

각 이름에 대해 현재 생존한 것으로 예상되는 사람의 수를 세고, 여성을 필터링하고, 생존할 것으로 예상되는 수를 기준으로 정렬한 다음 상위 25명의 결과를 가져옴.
각 이름을 가진 사람들의 연령 중앙값과 1사분위수 및 3사분위수를 계산.
y를 median_age, x를 y를 기준으로 내림차순 정렬한 name으로 지정

w_plot <- ggplot(
  data = com_fem, 
  aes(x = reorder(name, -median_age), y = median_age)
) + 
  xlab(NULL) + 
  ylab("Age (in years)") + 
  ggtitle("Median ages for females with the 25 most common names")

geom_linerange() 함수를 이용하여 노란색 막대 그래프를 그린다.

w_plot <- w_plot + 
  geom_linerange(
    aes(ymin = q1_age, ymax = q3_age), 
    color = "#f3d478", 
    size = 4.5, 
    alpha = 0.8
  )
w_plot

geom_point() 함수를 통하여 각 이름의 연령 중앙값을 나타내는 점 표시

w_plot <- w_plot +
  geom_point(
    fill = "#ed3324", 
    color = "white", 
    size = 2, 
    shape = 21
  )
w_plot

맥락을 추가하여 최종 그림을 완성
이름의 길이가 길어 제대로 표현되지 않으므로 coord_flip()로 반전

context <- tribble(
  ~median_age, ~x, ~label, 
  65, 24, "median",
  29, 16, "25th", 
  48, 16, "75th percentile",
)

age_breaks <- 1:7 * 10 + 5

fp <- w_plot + 
  geom_point(
    aes(y = 60, x = 24), 
    fill = "#ed3324", 
    color = "white", 
    size = 2, 
    shape = 21
  ) + 
  geom_text(data = context, aes(x = x, label = label)) + 
  geom_point(aes(y = 24, x = 16), shape = 17) + 
  geom_point(aes(y = 56, x = 16), shape = 17) +
  geom_hline(
    data = tibble(x = age_breaks), 
    aes(yintercept = x), 
    linetype = 3
  ) +
  scale_y_continuous(breaks = age_breaks) + 
  coord_flip()

Figure 25: Recreation of FiveThirtyEight’s plot of the age distributions for the 25 most common women’s names.

추가 자료

ggplot2 cheat sheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf