Our Apache Yarn cluster
35.221.222.125
hosts the flights data representing 123 million flights over 22 years.
You can connect to the R Studio Server at the master from the SNU network using the user id and password handed out in class.
Read the lecture notes on how to access the Yarn cluster. Connect to the database using sparklyr
and answer following questions. You can base your answers on a specific year or the whole data set.
Map the top 10 busiest airports. Size of dots should reflect the number of flights through that destination.
Hint: You may find this tutorial on Making Maps in R helpful.
Map the top 10 busiest direct routes. Size of lines should reflect the number of flights through that route.
Build a predictive model for the arrival delay (arrdelay
) of flights flying from JFK. Use the same filtering criteria as in the lecture notes to construct training and validation sets. You are allowed to use a maximum of 5 predictors. The prediction performance of your model on the validation data set will be an important factor for grading this question.
Visualize and explain any other information you want to explore.
In the lecture notes, function ml_linear_regression()
, which takes advantage of Spark library MLlib
for linear regression. Recall that we have 123 million observations and the standard lm()
won’t work for this size of data. In this question, we explore how big data analysis algorithms like ml_linear_regression()
is implemented.
sc <- spark_connect(master = "local")
sdf_len(sc, 1000) %>%
spark_apply(function(df) runif(nrow(df), min = -1, max = 1)^2+runif(nrow(df), min = -1, max = 1)^2 < 1) %>%
filter(result == TRUE) %>% count() %>% collect() * 4 / 1000
computes a Monte Carlo estimation of \(\pi\). Explain, as much in detail as possoble, how the above code does the computation.
Hint. This lecture note or paper (in Korean) may help.
Writg an R code that estimates the regression coefficients \(\hat{\beta}\) from the lecture notes’ regression analysis by implementing SGD in sparklyr
without using the MLlib
library (i.e., ml_linear_regression()
).
sdf_sample()
and model.matrix()
useful.UniqueCarrier
properly, you may need to collect all the carriers in the entire dataset before the analysis.spark_connect(master = "local")
) to save your time and resource.Added on 12/07/2018
35.221.222.125
has limited computing capacity. So it is recommended to try this problem on your own cluster. In case your GCP credit is limited, you are allowed to run the code on a local machine, with a smaller dataset (part of the flights
data). You will get an extra credit if you use the full dataset.