326.621A Homework 4

Q1. Big data analysis

Our Apache Yarn cluster

35.221.222.125

hosts the flights data representing 123 million flights over 22 years.

You can connect to the R Studio Server at the master from the SNU network using the user id and password handed out in class.

Read the lecture notes on how to access the Yarn cluster. Connect to the database using sparklyr and answer following questions. You can base your answers on a specific year or the whole data set.

Map the top 10 busiest airports. Size of dots should reflect the number of flights through that destination.
Hint: You may find this tutorial on Making Maps in R helpful.
Map the top 10 busiest direct routes. Size of lines should reflect the number of flights through that route.
Build a predictive model for the arrival delay (arrdelay) of flights flying from JFK. Use the same filtering criteria as in the lecture notes to construct training and validation sets. You are allowed to use a maximum of 5 predictors. The prediction performance of your model on the validation data set will be an important factor for grading this question.
Visualize and explain any other information you want to explore.

Q2. Big data algorithm

In the lecture notes, function ml_linear_regression(), which takes advantage of Spark library MLlib for linear regression. Recall that we have 123 million observations and the standard lm() won’t work for this size of data. In this question, we explore how big data analysis algorithms like ml_linear_regression() is implemented.

The following code

sc <- spark_connect(master = "local")
sdf_len(sc, 1000) %>%
  spark_apply(function(df) runif(nrow(df), min = -1, max = 1)^2+runif(nrow(df), min = -1, max = 1)^2 < 1) %>% 
  filter(result == TRUE) %>% count() %>% collect() * 4 / 1000

computes a Monte Carlo estimation of \(\pi\). Explain, as much in detail as possoble, how the above code does the computation.

Hint. This lecture note or paper (in Korean) may help.

Our goal is to reproduce the linear regression analysis in the lecture notes by stochatic gradient descent (SGD). SGD is an approximate optimization method for minimizing function \(\frac{1}{n}\sum_{j=1}^n f_j(\beta)\) by \[ \beta^{(k)} = \beta^{(k-1)} - \gamma_k \nabla f_i(\beta^{(k-1)}), \] where \(i\) is a uniformly sampled index from \(1,2,\dotsc,n\). The quantity \(\gamma_k\) is called the step size. Note that \(\mathbf{E}[\nabla f_i(\beta))]=\frac{1}{n}\sum_{j=1}^n \nabla f_j(\beta)\). In linear regression, \(f_i(\beta)=(1/2)(y_i-x_i^T\beta)^2\).

Writg an R code that estimates the regression coefficients \(\hat{\beta}\) from the lecture notes’ regression analysis by implementing SGD in sparklyr without using the MLlib library (i.e., ml_linear_regression()).

Hints:
- You may find functions sdf_sample() and model.matrix() useful.
- To deal with the categorical variable UniqueCarrier properly, you may need to collect all the carriers in the entire dataset before the analysis.
- Start from a smaller subset on a local machine (spark_connect(master = "local")) to save your time and resource.
- Typical choices of the step size \(\gamma_k\) are \(\gamma_k=\gamma_0\) (constant step size), \(\gamma_k=\gamma_0/k\), and \(\gamma_k = \gamma_0/\sqrt{k}\) (diminishing step sizes). SGD is known to be sensitive to the step size.

Added on 12/07/2018

The class cluster 35.221.222.125 has limited computing capacity. So it is recommended to try this problem on your own cluster. In case your GCP credit is limited, you are allowed to run the code on a local machine, with a smaller dataset (part of the flights data). You will get an extra credit if you use the full dataset.

326.621A Homework 4

Due December 9 December 16 @ 11:59PM

Q1. Big data analysis

Q2. Big data algorithm