Most material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html.
Small data: those can fit into computer memory.
Bigish data: those can fit into disk(s) of a single machine.
Big data: those cannot fit into disk(s) of a single machine.
Key to high performance is effective use of memory hierarchy. True on all architectures.
Operation | Time |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Read 1 MB sequentially from memory | 250,000 ns |
Read 1 MB sequentially from SSD | 1,000,000 ns |
Read 1 MB sequentially from disk | 20,000,000 ns |
Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.
If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds
If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds
If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds
This is just for reading data, if we make any modifications (writing) things are much worse.
Cost: Disk << Memory
Speed: Disk <<< Memory
So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?
Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.
Even with blocks, any kind of subsetting of rows requires a linear search, which requires \(O(N)\) accesses where \(N\) is the number of blocks.
We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.
Sorting is expensive, \(O(N \log N)\), but it only needs to be done once.
After sorting, we can use a binary search for any subsetting tasks \(O(\log N)\).
These sorted columns are known as indexes.
Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.
Structured Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.
ANSI Standard but with some dialect divergence.
SQL functionalities map very closely (but not exactly) with the data manipulation verbs present in dplyr.
We will see this mapping in more detail in a bit.
DBI package provides a common interface for connecting to databases.
dbplyr package is the backend that translates dplyr verbs to database SQL queries.
To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.
Create a SQLite database in memory for learning purpose.
library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
#con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## ..@ ptr :<externalptr>
## ..@ dbname : chr "/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/employee.sqlite"
## ..@ loadable.extensions: logi TRUE
## ..@ flags : int 70
## ..@ vfs : chr ""
## ..@ ref :<environment: 0x7fbc2b066cf8>
## ..@ bigint : chr "integer64"
First table:
library("tidyverse")
(employees <- tibble(name = c("Alice","Bob","Carol","Dave","Eve","Frank"),
email = c("alice@company.com", "bob@company.com",
"carol@company.com", "dave@company.com",
"eve@company.com", "frank@comany.com"),
salary = c(52000, 40000, 30000, 33000, 44000, 37000),
dept = c("Accounting", "Accounting","Sales",
"Accounting","Sales","Sales")))
## # A tibble: 6 x 4
## name email salary dept
## <chr> <chr> <dbl> <chr>
## 1 Alice alice@company.com 52000 Accounting
## 2 Bob bob@company.com 40000 Accounting
## 3 Carol carol@company.com 30000 Sales
## 4 Dave dave@company.com 33000 Accounting
## 5 Eve eve@company.com 44000 Sales
## 6 Frank frank@comany.com 37000 Sales
Second table:
(phone <- tibble(name = c("Bob", "Carol", "Eve", "Frank"),
phone = c("010- 555-1111", "010- 555-2222", "010- 555-3333", "010- 555-4444")))
## # A tibble: 4 x 2
## name phone
## <chr> <chr>
## 1 Bob 010- 555-1111
## 2 Carol 010- 555-2222
## 3 Eve 010- 555-3333
## 4 Frank 010- 555-4444
Write tables to the database:
dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
## [1] "employees" "phone"
dbWriteTable(con, "employs", employees)
dbListTables(con)
## [1] "employees" "employs" "phone"
dbRemoveTable(con, "employs")
dbListTables(con)
## [1] "employees" "phone"
res <- dbSendQuery(con, "SELECT * FROM employees")
dbFetch(res)
## name email salary dept
## 1 Alice alice@company.com 52000 Accounting
## 2 Bob bob@company.com 40000 Accounting
## 3 Carol carol@company.com 30000 Sales
## 4 Dave dave@company.com 33000 Accounting
## 5 Eve eve@company.com 44000 Sales
## 6 Frank frank@comany.com 37000 Sales
dbClearResult(res)
dbDisconnect(con)
Following we demonstrate some common SQL commands, although all task can be achieved by uisng dlyr as well.
con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)
## [1] "employees" "phone"
knitr::opts_chunk$set(connection = "con")
SELECT * FROM employees;
name | salary | dept | |
---|---|---|---|
Alice | alice@company.com | 52000 | Accounting |
Bob | bob@company.com | 40000 | Accounting |
Carol | carol@company.com | 30000 | Sales |
Dave | dave@company.com | 33000 | Accounting |
Eve | eve@company.com | 44000 | Sales |
Frank | frank@comany.com | 37000 | Sales |
SELECT * FROM phone;
name | phone |
---|---|
Bob | 010- 555-1111 |
Carol | 010- 555-2222 |
Eve | 010- 555-3333 |
Frank | 010- 555-4444 |
SELECT name AS first_name, salary FROM employees;
first_name | salary |
---|---|
Alice | 52000 |
Bob | 40000 |
Carol | 30000 |
Dave | 33000 |
Eve | 44000 |
Frank | 37000 |
SELECT name AS first_name, salary FROM employees ORDER BY salary;
first_name | salary |
---|---|
Carol | 30000 |
Dave | 33000 |
Frank | 37000 |
Bob | 40000 |
Eve | 44000 |
Alice | 52000 |
Descending order:
SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
first_name | salary |
---|---|
Alice | 52000 |
Eve | 44000 |
Bob | 40000 |
Frank | 37000 |
Dave | 33000 |
Carol | 30000 |
SELECT * FROM employees WHERE salary < 40000
name | salary | dept | |
---|---|---|---|
Carol | carol@company.com | 30000 | Sales |
Dave | dave@company.com | 33000 | Accounting |
Frank | frank@comany.com | 37000 | Sales |
SELECT * FROM employees GROUP BY dept;
name | salary | dept | |
---|---|---|---|
Dave | dave@company.com | 33000 | Accounting |
Frank | frank@comany.com | 37000 | Sales |
SELECT * FROM employees LIMIT 3;
SELECT * FROM employees ORDER BY name DESC LIMIT 3;
name | salary | dept | |
---|---|---|---|
Alice | alice@company.com | 52000 | Accounting |
Bob | bob@company.com | 40000 | Accounting |
Carol | carol@company.com | 30000 | Sales |
By default SQLite uses a CROSS JOIN
which is not terribly useful
SELECT * FROM employees JOIN phone;
name | salary | dept | name..5 | phone | |
---|---|---|---|---|---|
Alice | alice@company.com | 52000 | Accounting | Bob | 010- 555-1111 |
Alice | alice@company.com | 52000 | Accounting | Carol | 010- 555-2222 |
Alice | alice@company.com | 52000 | Accounting | Eve | 010- 555-3333 |
Alice | alice@company.com | 52000 | Accounting | Frank | 010- 555-4444 |
Bob | bob@company.com | 40000 | Accounting | Bob | 010- 555-1111 |
Bob | bob@company.com | 40000 | Accounting | Carol | 010- 555-2222 |
Bob | bob@company.com | 40000 | Accounting | Eve | 010- 555-3333 |
Bob | bob@company.com | 40000 | Accounting | Frank | 010- 555-4444 |
Carol | carol@company.com | 30000 | Sales | Bob | 010- 555-1111 |
Carol | carol@company.com | 30000 | Sales | Carol | 010- 555-2222 |
SELECT * FROM employees NATURAL JOIN phone;
name | salary | dept | phone | |
---|---|---|---|---|
Bob | bob@company.com | 40000 | Accounting | 010- 555-1111 |
Carol | carol@company.com | 30000 | Sales | 010- 555-2222 |
Eve | eve@company.com | 44000 | Sales | 010- 555-3333 |
Frank | frank@comany.com | 37000 | Sales | 010- 555-4444 |
SELECT * FROM employees JOIN phone ON employees.name = phone.name;
name | salary | dept | name..5 | phone | |
---|---|---|---|---|---|
Bob | bob@company.com | 40000 | Accounting | Bob | 010- 555-1111 |
Carol | carol@company.com | 30000 | Sales | Carol | 010- 555-2222 |
Eve | eve@company.com | 44000 | Sales | Eve | 010- 555-3333 |
Frank | frank@comany.com | 37000 | Sales | Frank | 010- 555-4444 |
SELECT * FROM employees NATURAL LEFT JOIN phone;
name | salary | dept | phone | |
---|---|---|---|---|
Alice | alice@company.com | 52000 | Accounting | NA |
Bob | bob@company.com | 40000 | Accounting | 010- 555-1111 |
Carol | carol@company.com | 30000 | Sales | 010- 555-2222 |
Dave | dave@company.com | 33000 | Accounting | NA |
Eve | eve@company.com | 44000 | Sales | 010- 555-3333 |
Frank | frank@comany.com | 37000 | Sales | 010- 555-4444 |
SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
name | salary | dept | name..5 | phone | |
---|---|---|---|---|---|
Alice | alice@company.com | 52000 | Accounting | NA | NA |
Bob | bob@company.com | 40000 | Accounting | Bob | 010- 555-1111 |
Carol | carol@company.com | 30000 | Sales | Carol | 010- 555-2222 |
Dave | dave@company.com | 33000 | Accounting | NA | NA |
Eve | eve@company.com | 44000 | Sales | Eve | 010- 555-3333 |
Frank | frank@comany.com | 37000 | Sales | Frank | 010- 555-4444 |
SQLite does not support directly an OUTER JOIN
or a RIGHT JOIN
.
CREATE INDEX index_name ON employees (name);
CREATE INDEX index_name_email ON employees (name, email);
sqlite3 employee.sqlite .indices
## index_name
## index_name_email
dbDisconnect(con)