Q1. Big data analysis

Our Apache Yarn cluster

35.221.222.125

hosts the flights data representing 123 million flights over 22 years.

You can connect to the R Studio Server at the master from the SNU network using the user id and password handed out in class.

Read the lecture notes on how to access the Yarn cluster. Connect to the database using sparklyr and answer following questions. You can base your answers on a specific year or the whole data set.

  1. Map the top 10 busiest airports. Size of dots should reflect the number of flights through that destination.
    Hint: You may find this tutorial on Making Maps in R helpful.

  2. Map the top 10 busiest direct routes. Size of lines should reflect the number of flights through that route.

  3. Build a predictive model for the arrival delay (arrdelay) of flights flying from JFK. Use the same filtering criteria as in the lecture notes to construct training and validation sets. You are allowed to use a maximum of 5 predictors. The prediction performance of your model on the validation data set will be an important factor for grading this question.

  4. Visualize and explain any other information you want to explore.

Q2. Big data algorithm

In the lecture notes, function ml_linear_regression(), which takes advantage of Spark library MLlib for linear regression. Recall that we have 123 million observations and the standard lm() won’t work for this size of data. In this question, we explore how big data analysis algorithms like ml_linear_regression() is implemented.

  1. The following code
sc <- spark_connect(master = "local")
sdf_len(sc, 1000) %>%
  spark_apply(function(df) runif(nrow(df), min = -1, max = 1)^2+runif(nrow(df), min = -1, max = 1)^2 < 1) %>% 
  filter(result == TRUE) %>% count() %>% collect() * 4 / 1000

computes a Monte Carlo estimation of \(\pi\). Explain, as much in detail as possoble, how the above code does the computation.

Hint. This lecture note or paper (in Korean) may help.

  1. Our goal is to reproduce the linear regression analysis in the lecture notes by stochatic gradient descent (SGD). SGD is an approximate optimization method for minimizing function \(\frac{1}{n}\sum_{j=1}^n f_j(\beta)\) by \[ \beta^{(k)} = \beta^{(k-1)} - \gamma_k \nabla f_i(\beta^{(k-1)}), \] where \(i\) is a uniformly sampled index from \(1,2,\dotsc,n\). The quantity \(\gamma_k\) is called the step size. Note that \(\mathbf{E}[\nabla f_i(\beta))]=\frac{1}{n}\sum_{j=1}^n \nabla f_j(\beta)\). In linear regression, \(f_i(\beta)=(1/2)(y_i-x_i^T\beta)^2\).

Writg an R code that estimates the regression coefficients \(\hat{\beta}\) from the lecture notes’ regression analysis by implementing SGD in sparklyr without using the MLlib library (i.e., ml_linear_regression()).

Added on 12/07/2018