Problem 1, 4
Problem 2
Problem 1
Problem 1, 4, 5
Problem 5
Problem 4
Problem 2
Problem 1, 2
Problem 2
Problem 4
Problem 2, 4
Problem 5
Problem 4, 7
Package quantmod
fetches financial data from public-domain sources, e.g., Yahoo! Finance (http://finance.yahoo.com). You can get KOSPI data as well:
library(quantmod)
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Version 0.4-0 included new data defaults. See ?getSymbols.
options("getSymbols.warning4.0"=FALSE) # to suppress warnings
skt <- getSymbols("017670.KS", auto.assign=FALSE) # KOSPI tick number is 017670
head(skt)
## 017670.KS.Open 017670.KS.High 017670.KS.Low 017670.KS.Close
## 2007-01-02 219000 223000 219000 222000
## 2007-01-03 222000 223000 218000 218000
## 2007-01-04 217500 221000 215000 220500
## 2007-01-05 218000 222500 216000 222500
## 2007-01-08 223000 225500 220500 223000
## 2007-01-09 222000 222500 218000 219000
## 017670.KS.Volume 017670.KS.Adjusted
## 2007-01-02 97786 116144.9
## 2007-01-03 105863 114052.1
## 2007-01-04 142449 115360.1
## 2007-01-05 148605 116406.5
## 2007-01-08 176020 116668.0
## 2007-01-09 137777 114575.4
Variable skt
is xts
class, which is similar to the data frame or tibble
but designed to handle time series easily. The fourth column is the stock price adjusted for dividends and splits, and you can plot it using R’s default plot()
:
plot(skt$`017670.KS.Close`)
ggplot2
instead of R’s default plot
. Unfortunately, ggplot2
does not support xts
objects. Your first task is to convert skt
into a tibble
. How would you do this? Once you succeed in this task, the next step is to plot the closing prices using ggplot()
. The conversion process won’t show the date information, but it is hidden in rownames()
. Add a new date
variable to your converted tibble
, and plot the closing prices using the geom_line
primitive. (Hint. base::as.Date()
)quandmod::getFX()
, download the recent 180-day history of USD/KRW and JPY/KRW exchange rates. Then calculate skt’s adjusted closing prices in USD and in JPY, and plot the three time series using ggplot2
. Since the currency scale varies much, normalize the USD and JPY prices so that the initial price coincides with the KRW price. A problem with this analysis is that the time points in the stock price data and the exchange rate data do not always coincide. Explain how you extract the common time points.In this question, you practice data cleansing as well as computational statistical inference. We use a 2010 Census data set from the KOSIS (KOrean Statistical Information Service), available at http://kosis.kr/statisticsList/statisticsList_01List.jsp?vwcd=MT_ZTITLE&parmTabId=M_01_01.
Download Seoul’s district population data as follows. Select “인구 -> 인구총조사 -> 인구부문 -> 총조사인구(2015년 이후) -> 전수부문 (등록센서스, 2015년 이후) -> 전수기본표 -> 연령 및 성별 인구 - 읍면동(2015,2020), 시군구(2016~2019) 수록기간 년 2015~2020”. This will create a new tab. In this new tab, select the “행정구역별(읍면동)” tab and uncheck “1레벨 전체선택” to only check “서울특별시”. Then check “2레벨 전체선택”. After that, select “연령별” tab, uncheck “1레벨 전체선택”, and only check “합계”. Click the “통계표조회” icon to download the data as “EXCEL(xlsx)” format with “셀병합” unchecked. Open the downloaded file in Microsoft Excel and save as the CSV format. Now read the data in R using the tidyverse function read_csv()
. This data set is not as clean as the nycflights13
data set in class; there are two header lines and that are peripheral to the core information; and the numerical values are expressed as strings with commas. e.g., "9,631,482"
instead of 9631482
. Also, for some reason it contains districts of other cities. Using the help command ?read_csv
to learn about the function, design an R expression that will give you the tibble
called seoul2020
that has 25 rows and 10 columns.
The resulting tibble
is not tidy. Tidy seoul2020
to create a new tibble
named seoul2020tidy
. Explain your reasoning.
The major data cleansing task is to get rid of the commas in the numerical
values. Study the “Parsing a vector” section of Lecture 5 and convert all the columns where numerical values are expressed as strings with commas into numeric vectors.
Finally, data analysis. Take the column corresponding to the total population of each district from the final data frame (this should be the second column if the previous steps were done correctly) and store it as a vector into pops
. Plot the histogram of pops
, and compare this with the histogram of a normal random vector with the same length, having the same mean and variance.
Consider the numbers in the finite vector pops
as the population distribution, draw a sample of size 10 with replacement from this population. Is the mean of this vector the same as the mean of the population? Draw another sample of size 10 with replacement. Is the sample mean the same as before? Why or why not?
Using the function replicate()
, take the mean of the sample means. Increase the number of replications from 100 to 1000, and to 10000, and compare the means to the population mean. What do you observe? What is the phenomenon that you observe called?
Repeat part (e) with the sample size increased from 10 to 25, and then to 100. What can you say about the numbers?
Again using replicate()
, plot the histogram of 10000 sample means for sample sizes 10, 25, and 100. Do the histograms look like hist(pops)
, or something else? What is this phenomenon called?
Now suppose you are given a new data set of size 10 whose mean is 204885. Do you think this data set is taken from Seoul’s population? Justify your answer by comparing the new sample mean to the distribution of sample means considered in parts (d) – (i). (Hint. quantile()
. Use 2.5% and 97.5% quantiles.)