The goal of this project is to access and analyze big data that cannot be loaded directly to R’s base memory. The data you are going to analyze consists of flight arrival and departure details for all commercial flights within the USA, from January 2001 to December 2018. This is a large dataset: there are nearly 120 million records in total, and takes up to 3 gigabytes of space compressed and 16 gigabytes when uncompressed.
Because of the overwhelming size of the data, we use a relational database (RDB) to efficiently manage the large dataset. In this project you are supposed to construct a database by using the data.table and RSQLite packages. Then you use dplyr to access the database and other R functions to analyze the data.
The project consists of three parts. In Part 1, you construct a SQLite database using the DBI and RSQLite packages. Part 2 provides basic questions. You must be able to solve these question using the knowledge learned in class. You will get full credit if you successfully complete parts 1 and 2. Part 3 contains extra questions. You are supposed to design your own analysis in a way that you think is reasonable, conduct the analysis, and interpret the outcomes. You will receive extra credits if you fulfill the request.
The following website http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236 provides flights data between 1987-2018 in the form of zip compressed csv file format.
The data comes originally from BTS where it is described in detail.
Variable descriptions are as follows:
1 Year 2001-2018
2 Month 1-12
3 DayofMonth 1-31
4 DayOfWeek 1 (Monday) - 7 (Sunday)
5 DepTime actual departure time (local, hhmm)
6 CRSDepTime scheduled departure time (local, hhmm)
7 ArrTime actual arrival time (local, hhmm)
8 CRSArrTime scheduled arrival time (local, hhmm)
9 UniqueCarrier unique carrier code
10 FlightNum flight number
11 TailNum plane tail number
12 ActualElapsedTime in minutes
13 CRSElapsedTime in minutes
14 AirTime in minutes
15 ArrDelay arrival delay, in minutes
16 DepDelay departure delay, in minutes
17 Origin origin IATA airport code
18 Dest destination IATA airport code
19 Distance in miles
20 TaxiIn taxi in time, in minutes
21 TaxiOut taxi out time in minutes
22 Cancelled was the flight cancelled?
23 CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
24 Diverted 1 = yes, 0 = no
25 CarrierDelay in minutes
26 WeatherDelay in minutes
27 NASDelay in minutes
28 SecurityDelay in minutes
29 LateAircraftDelay in minutes
airlines.csv from http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERSairports.csv from https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.datairplanes.csv from hereCancelationCode is provided from 2003.In this part of the project we construct a relational database using the RSQLite package, an R interface to the SQLite database management system. Recall Chapter 13 of the textbook where we study relational data, in which multiple tables are connected via keys. A (relational) database management system is software that provide fast and efficient access to large sets of tables. This system use a special language called structured query language (SQL) to query and modify the tables. Because the data you analyze is big (there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.), R cannot load the dataset directly into the memory. Using a database is a must. Fortunately, you do not need to learn SQL for this project, as dplyr supports a variety of databases as a backend and translated dplyr verbs to SQL queries. However, keep in mind that you need a large disk space (approximately 50 GB).
DBI and RSQLite packages. SQLite is contained in RSQLite, so you do not need to install it.library("DBI")
library("RSQLite")
con <- dbConnect(RSQLite::SQLite(), "employee.sqlite")
str(con)
library("tidyverse")
employees <- tibble(name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales"))
phone <- tibble(name = c("Bob", "Carol", "Eve", "Frank"),
phone = c("010-555-1111", "010-555-2222", "010-555-3333", "010-555-4444"))
dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
dbDisconnect(con)
dplyr to query the database. Of course, you need to reconnect to the database:recon <- dbConnect(RSQLite::SQLite(), "employee.sqlite")
emp <- dplyr::tbl(recon, "employees")
str(emp)
ph <- dplyr::tbl(recon, "phone")
str(ph)
emp and ph as if they are just tibbles:addr <- emp %>%
select(name, salary) %>%
arrange(salary)
addr
left_join(emp,ph)
res <- dbSendQuery(recon, "SELECT * FROM employees")
dbFetch(res)
dbClearResult(res)
dbDisconnect(recon)
Now the task: provide all the outputs from the above R scripts.
(40 pts) We start with small tables. Download files airports.csv, airlines.csv, and airplanes.csv as described above. Then write an R code snippet that creates a SQLite database containing these three tables in the same names as the filenames (without the extension).
(40 pts) Download the main data files using the provide R script from the BTS website above.
Do NOT uncompress the files. readr::read_csv() supports reading zip-compressed csv files directly. Using a for loop, write an R code snippet that creates a database table flights that contain all the data from 2001 through 2018. (Hint: study append option in dbWriteTable().)
Caution: Indexing takes a lot of hard disk space. If your disk space is limited, then you may skip this step. However, you should nevertheless write a working code and submit.
flightdata.R file so that source()ing this file will create the database at once.We first start by exploring the airports table. Using dplyr::filter(), find out which airports the codes “SNA”, “SJC”, and “SMF” belong to.
Aggregate the counts of flights to all three of these airports at the monthly level (in the flights table) into a new data frame airportcounts. You may find dplyr functions group_by(), summarise(), collect(), and the pipe operator %>% useful.
Add a new column to airportcounts by constructing a Date variable from the variables Year and Month (using helper functions from the lubridate package). Sort the rows in ascending order of dates.
From airportcounts, generate a time series plot that plots the number of flights per month in each of the three airports in chronological order.
Find the top ten months (like 2001-09) with the largest number of flights for each of the three airports.
Which airline was most reliable flying from Chicago O’Hare (ORD) to Minneapolis/St. Paul (MSP) in Year 2015?
Create a data frame delays that contains the average arrival delay for each day in 2015 for four airlines: United (UA), Delta (DL), American (AA), and Eagle (MQ). Your data frame must contain only necessary variables, to save the memory space.
Compare the average delay of the four airlines by generating density plots comparing them in a single panel. In doing this, use a join function to provide the full names of the airlines in the legend of the plot. Which airline is the most reliable? which is the least?
Plot the count of all flights on Mondays in the year 2001. Explain the pattern you find in the visualization.
Repeat 1. for the year 2011.
Provide a graphical summary to answer the following questions. These are intentionally vague in order to allow you to focus on different aspects of the data. We only provide some suggestions.
Suggestions: cancelled flights do not possess delays. You may want to ignore negative delays.
Suggestions: you may want to find a correlation between the age of the plane and the departure delay. If the data size is too big to compute the correlation, you may try to sample a fraction from the dataset.