326_212-2019fall

Final Project

Due date: Dec 11, 2018 @ 11:59pm

Introduction

The dataset

1    Year   2001-2018
2    Month  1-12
3    DayofMonth 1-31
4    DayOfWeek  1 (Monday) - 7 (Sunday)
5    DepTime    actual departure time (local, hhmm)
6    CRSDepTime scheduled departure time (local, hhmm)
7    ArrTime    actual arrival time (local, hhmm)
8    CRSArrTime scheduled arrival time (local, hhmm)
9    UniqueCarrier  unique carrier code
10   FlightNum  flight number
11   TailNum    plane tail number
12   ActualElapsedTime  in minutes
13   CRSElapsedTime in minutes
14   AirTime    in minutes
15   ArrDelay   arrival delay, in minutes
16   DepDelay   departure delay, in minutes
17   Origin origin IATA airport code
18   Dest   destination IATA airport code
19   Distance   in miles
20   TaxiIn taxi in time, in minutes
21   TaxiOut    taxi out time in minutes
22   Cancelled  was the flight cancelled?
23   CancellationCode   reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24   Diverted   1 = yes, 0 = no
25   CarrierDelay   in minutes
26   WeatherDelay   in minutes
27   NASDelay   in minutes
28   SecurityDelay  in minutes
29   LateAircraftDelay  in minutes

Part 1. Constructing database (150 pts)

In this part of the project we construct a relational database using the RSQLite package, an R interface to the SQLite database management system. Recall Chapter 13 of the textbook where we study relational data, in which multiple tables are connected via keys. A (relational) database management system is software that provide fast and efficient access to large sets of tables. This system use a special language called structured query language (SQL) to query and modify the tables. Because the data you analyze is big (there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.), R cannot load the dataset directly into the memory. Using a database is a must. Fortunately, you do not need to learn SQL for this project, as dplyr supports a variety of databases as a backend and translated dplyr verbs to SQL queries. However, keep in mind that you need a large disk space (approximately 50 GB).

  1. (20 pts) As an exercise, we create a simple SQLite database and query. Directions are as follows:
    • Install DBI and RSQLite packages. SQLite is contained in RSQLite, so you do not need to install it.
    • Create an empty SQLite database:
    library("DBI")
    library("RSQLite")
    con <- dbConnect(RSQLite::SQLite(), "employee.sqlite")
    str(con)
    
    • Create two data frames:
    library("tidyverse")
    employees <- tibble(name   = c("Alice","Bob","Carol","Dave","Eve","Frank"),
                     email  = c("alice@company.com", "bob@company.com",
                                "carol@company.com", "dave@company.com",
                                "eve@company.com",   "frank@comany.com"),
                     salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                     dept   = c("Accounting", "Accounting","Sales",
                                "Accounting","Sales","Sales"))
    phone <- tibble(name  = c("Bob", "Carol", "Eve", "Frank"),
                 phone = c("010-555-1111", "010-555-2222", "010-555-3333", "010-555-4444"))
    
    • Write the data frames to the database:
    dbWriteTable(con, "employees", employees, overwrite = TRUE)
    dbWriteTable(con, "phone", phone, overwrite = TRUE)
    
    • You can list the database tables:
    dbListTables(con)
    
    • After creating a database, you may disconnect the connection:
    dbDisconnect(con)
    
    • Now use dplyr to query the database. Of course, you need to reconnect to the database:
    recon <- dbConnect(RSQLite::SQLite(), "employee.sqlite")
    emp <- dplyr::tbl(recon, "employees")
    str(emp)
    ph <- dplyr::tbl(recon, "phone")
    str(ph)
    
    • You can treat emp and ph as if they are just tibbles:
    addr <- emp %>%
        select(name, salary) %>%
        arrange(salary)
    addr
    left_join(emp,ph)
    
    • You can do the similar task by directly feeding SQL, the language for database, through DBI:
    res <- dbSendQuery(recon, "SELECT * FROM employees")
    dbFetch(res)
    dbClearResult(res)
    
    • Don’t forget to disconnect after you finish your job:
    dbDisconnect(recon)
    

    Now the task: provide all the outputs from the above R scripts.

  2. (40 pts) We start with small tables. Download files airports.csv, airlines.csv, and airplanes.csv as described above. Then write an R code snippet that creates a SQLite database containing these three tables in the same names as the filenames (without the extension).

  3. (40 pts) Download the main data files using the provide R script from the BTS website above. Do NOT uncompress the files. readr::read_csv() supports reading zip-compressed csv files directly. Using a for loop, write an R code snippet that creates a database table flights that contain all the data from 2001 through 2018. (Hint: study append option in dbWriteTable().)

  4. (40 pts) The created database is more than 10 GB, and accessing the data is this large database takes long time even with a database management system. To speed up access to the data, you need to add indices. In this web page, read the section “Adding indices,” and write an R code snippet that adds indices to the database you’ve just created. (Hint. most of the codes in the Data Expo webpage are SQL queries.)

Caution: Indexing takes a lot of hard disk space. If your disk space is limited, then you may skip this step. However, you should nevertheless write a working code and submit.

  1. (10 pts) Put all the R code snippets you wrote into flightdata.R file so that source()ing this file will create the database at once.

Part 2. Basic Questions (150 pts)

Q1. Monthly traffic in three airports (60 pts)

  1. We first start by exploring the airports table. Using dplyr::filter(), find out which airports the codes “SNA”, “SJC”, and “SMF” belong to.

  2. Aggregate the counts of flights to all three of these airports at the monthly level (in the flights table) into a new data frame airportcounts. You may find dplyr functions group_by(), summarise(), collect(), and the pipe operator %>% useful.

  3. Add a new column to airportcounts by constructing a Date variable from the variables Year and Month (using helper functions from the lubridate package). Sort the rows in ascending order of dates.

  4. From airportcounts, generate a time series plot that plots the number of flights per month in each of the three airports in chronological order.

  5. Find the top ten months (like 2001-09) with the largest number of flights for each of the three airports.

Q2. Finding reliable airlines (60 pts)

Which airline was most reliable flying from Chicago O’Hare (ORD) to Minneapolis/St. Paul (MSP) in Year 2015?

  1. Create a data frame delays that contains the average arrival delay for each day in 2015 for four airlines: United (UA), Delta (DL), American (AA), and Eagle (MQ). Your data frame must contain only necessary variables, to save the memory space.

  2. Compare the average delay of the four airlines by generating density plots comparing them in a single panel. In doing this, use a join function to provide the full names of the airlines in the legend of the plot. Which airline is the most reliable? which is the least?

Q3. All flights (30 pts)

  1. Plot the count of all flights on Mondays in the year 2001. Explain the pattern you find in the visualization.

  2. Repeat 1. for the year 2011.

Part 3. Advanced Questions (100 pts)

Provide a graphical summary to answer the following questions. These are intentionally vague in order to allow you to focus on different aspects of the data. We only provide some suggestions.

Q1. When is the best time of day/day of week/time of year to fly to minimise delays? (50 pts)

Suggestions: cancelled flights do not possess delays. You may want to ignore negative delays.

Q2. Do older planes suffer more delays? (50 pts)

Suggestions: you may want to find a correlation between the age of the plane and the departure delay. If the data size is too big to compute the correlation, you may try to sample a fraction from the dataset.