Chapter 11

Textbook 11.2.2

Problem 5

Textbook 11.3.5

Problem 4, 7

Extra questions

Package quantmod fetches financial data from public-domain sources, e.g., Yahoo! Finance (http://finance.yahoo.com). You can get KOSPI data as well:

library(quantmod)

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Version 0.4-0 included new data defaults. See ?getSymbols.

options("getSymbols.warning4.0"=FALSE) # to suppress warnings 
skt <- getSymbols("017670.KS", auto.assign=FALSE)  # KOSPI tick number is 017670
head(skt)

##            017670.KS.Open 017670.KS.High 017670.KS.Low 017670.KS.Close
## 2007-01-02         219000         223000        219000          222000
## 2007-01-03         222000         223000        218000          218000
## 2007-01-04         217500         221000        215000          220500
## 2007-01-05         218000         222500        216000          222500
## 2007-01-08         223000         225500        220500          223000
## 2007-01-09         222000         222500        218000          219000
##            017670.KS.Volume 017670.KS.Adjusted
## 2007-01-02            97786           116144.9
## 2007-01-03           105863           114052.1
## 2007-01-04           142449           115360.1
## 2007-01-05           148605           116406.5
## 2007-01-08           176020           116668.0
## 2007-01-09           137777           114575.4

Variable skt is xts class, which is similar to the data frame or tibble but designed to handle time series easily. The fourth column is the stock price adjusted for dividends and splits, and you can plot it using R’s default plot():

plot(skt$`017670.KS.Close`)

Suppose you want to plot the closing price above using ggplot2 instead of R’s default plot. Unfortunately, ggplot2 does not support xts objects. Your first task is to convert skt into a tibble. How would you do this? Once you succeed in this task, the next step is to plot the closing prices using ggplot(). The conversion process won’t show the date information, but it is hidden in rownames(). Add a new date variable to your converted tibble, and plot the closing prices using the geom_line primitive. (Hint. base::as.Date())
Using quandmod::getFX(), download the recent 180-day history of USD/KRW and JPY/KRW exchange rates. Then calculate skt’s adjusted closing prices in USD and in JPY, and plot the three time series using ggplot2. Since the currency scale varies much, normalize the USD and JPY prices so that the initial price coincides with the KRW price. A problem with this analysis is that the time points in the stock price data and the exchange rate data do not always coincide. Explain how you extract the common time points.

In this question, you practice data cleansing as well as computational statistical inference. We use a 2010 Census data set from the KOSIS (KOrean Statistical Information Service), available at http://kosis.kr/statisticsList/statisticsList_01List.jsp?vwcd=MT_ZTITLE&parmTabId=M_01_01.
1. Download Seoul’s district population data as follows. Select “인구 -> 인구총조사 -> 인구부문 -> 총조사인구(2015년 이후) -> 전수부문 (등록센서스, 2015년 이후) -> 전수기본표 -> 연령 및 성별 인구 - 읍면동(2015,2020), 시군구(2016~2019) 수록기간 년 2015~2020”. This will create a new tab. In this new tab, select the “행정구역별(읍면동)” tab and uncheck “1레벨 전체선택” to only check “서울특별시”. Then check “2레벨 전체선택”. After that, select “연령별” tab, uncheck “1레벨 전체선택”, and only check “합계”. Click the “통계표조회” icon to download the data as “EXCEL(xlsx)” format with “셀병합” unchecked. Open the downloaded file in Microsoft Excel and save as the CSV format. Now read the data in R using the tidyverse function read_csv(). This data set is not as clean as the nycflights13 data set in class; there are two header lines and that are peripheral to the core information; and the numerical values are expressed as strings with commas. e.g., "9,631,482" instead of 9631482. Also, for some reason it contains districts of other cities. Using the help command ?read_csv to learn about the function, design an R expression that will give you the tibble called seoul2020 that has 25 rows and 10 columns.
2. The resulting tibble is not tidy. Tidy seoul2020 to create a new tibble named seoul2020tidy. Explain your reasoning.
3. The major data cleansing task is to get rid of the commas in the numerical values. Study the “Parsing a vector” section of Lecture 5 and convert all the columns where numerical values are expressed as strings with commas into numeric vectors.
4. Finally, data analysis. Take the column corresponding to the total population of each district from the final data frame (this should be the second column if the previous steps were done correctly) and store it as a vector into pops. Plot the histogram of pops, and compare this with the histogram of a normal random vector with the same length, having the same mean and variance.
5. Consider the numbers in the finite vector pops as the population distribution, draw a sample of size 10 with replacement from this population. Is the mean of this vector the same as the mean of the population? Draw another sample of size 10 with replacement. Is the sample mean the same as before? Why or why not?
6. Using the function replicate(), take the mean of the sample means. Increase the number of replications from 100 to 1000, and to 10000, and compare the means to the population mean. What do you observe? What is the phenomenon that you observe called?
7. Repeat part (e) with the sample size increased from 10 to 25, and then to 100. What can you say about the numbers?
8. Again using replicate(), plot the histogram of 10000 sample means for sample sizes 10, 25, and 100. Do the histograms look like hist(pops), or something else? What is this phenomenon called?
9. Now suppose you are given a new data set of size 10 whose mean is 204885. Do you think this data set is taken from Seoul’s population? Justify your answer by comparing the new sample mean to the distribution of sample means considered in parts (d) – (i). (Hint. quantile(). Use 2.5% and 97.5% quantiles.)

326.212 Homework 2

Due Oct 12, 2021 @ 11:59pm

Chapter 5

Textbook 5.2.4

Textbook 5.3.1

Textbook 5.4.1

Textbook 5.5.2

Textbook 5.6.7

Textbook 5.7.1

Chapter 7

Textbook 7.3.4

Textbook 7.5.1.1

Textbook 7.5.2.1

Textbook 7.5.3.1

Chapter 10

Textbook 10.5

Chapter 11

Textbook 11.2.2

Textbook 11.3.5

Extra questions