Acknowledgement

Most material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html.

The why of databases

Size of data

  • Small data: those can fit into computer memory.

  • Bigish data: those can fit into disk(s) of a single machine.

  • Big data: those cannot fit into disk(s) of a single machine.

Computer architecture

Key to high performance is effective use of memory hierarchy. True on all architectures.

Numbers everyone should know

Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

Implications for bigish data

Suppose we have a 10 GB flat data file and that we want to select certain rows based on a given criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:
10 GB × (250 μs/1 MB) = 2.5 seconds

If we have to access the file from SSD (~1GB/sec):
10 GB × (1 ms/1 MB) = 10 seconds

If we have to access the file from disk:
10 GB × (20 ms/1 MB) = 200 seconds

This is just for reading data, if we make any modifications (writing) things are much worse.

Blocks

Cost: Disk << Memory

Speed: Disk <<< Memory

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can’t fit everything into memory?

Create blocks - group rows based on similar attributes and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

Databases

SQL

Structured Query Language is a special purpose language for interacting with (querying and modifying) these indexed tabular data structures.

  • ANSI Standard but with some dialect divergence.

  • SQL functionalities map very closely (but not exactly) with the data manipulation verbs present in dplyr.

  • We will see this mapping in more detail in a bit.

Access databases from R

  • dplyr package supports a variety of databases.
    • Open source databases: SQLite, MySQL, PostgreSQL, BigQuery.
    • Commercial databases: Oracle, Microsoft SQL Server.
    • See link for a complete list.
  • DBI package provides a common interface for connecting to databases.

  • dbplyr package is the backend that translates dplyr verbs to database SQL queries.

  • To install database drivers, follow instructions at https://db.rstudio.com/best-practices/drivers/.

A sample session using SQLite

Create a SQLite database

Create a SQLite database in memory for learning purpose.

library("DBI")
library("RSQLite")
con = dbConnect(RSQLite::SQLite(), "employee.sqlite")
#con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)
## Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   ..@ ptr                :<externalptr> 
##   ..@ dbname             : chr "/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/employee.sqlite"
##   ..@ loadable.extensions: logi TRUE
##   ..@ flags              : int 70
##   ..@ vfs                : chr ""
##   ..@ ref                :<environment: 0x7fbc2b066cf8> 
##   ..@ bigint             : chr "integer64"

Add a table into database

First table:

library("tidyverse")
(employees <- tibble(name   = c("Alice","Bob","Carol","Dave","Eve","Frank"),
                     email  = c("alice@company.com", "bob@company.com",
                                "carol@company.com", "dave@company.com",
                                "eve@company.com",   "frank@comany.com"),
                     salary = c(52000, 40000, 30000, 33000, 44000, 37000),
                     dept   = c("Accounting", "Accounting","Sales",
                                "Accounting","Sales","Sales")))
## # A tibble: 6 x 4
##   name  email             salary dept      
##   <chr> <chr>              <dbl> <chr>     
## 1 Alice alice@company.com  52000 Accounting
## 2 Bob   bob@company.com    40000 Accounting
## 3 Carol carol@company.com  30000 Sales     
## 4 Dave  dave@company.com   33000 Accounting
## 5 Eve   eve@company.com    44000 Sales     
## 6 Frank frank@comany.com   37000 Sales

Second table:

(phone <- tibble(name  = c("Bob", "Carol", "Eve", "Frank"),
                 phone = c("010- 555-1111", "010- 555-2222", "010- 555-3333", "010- 555-4444")))
## # A tibble: 4 x 2
##   name  phone        
##   <chr> <chr>        
## 1 Bob   010- 555-1111
## 2 Carol 010- 555-2222
## 3 Eve   010- 555-3333
## 4 Frank 010- 555-4444

Write tables to the database:

dbWriteTable(con, "employees", employees, overwrite = TRUE)
dbWriteTable(con, "phone", phone, overwrite = TRUE)
dbListTables(con)
## [1] "employees" "phone"

Add another table

dbWriteTable(con, "employs", employees)
dbListTables(con)
## [1] "employees" "employs"   "phone"

Remove a table from database

dbRemoveTable(con, "employs")
dbListTables(con)
## [1] "employees" "phone"

Querying tables

res <- dbSendQuery(con, "SELECT * FROM employees")
dbFetch(res)
##    name             email salary       dept
## 1 Alice alice@company.com  52000 Accounting
## 2   Bob   bob@company.com  40000 Accounting
## 3 Carol carol@company.com  30000      Sales
## 4  Dave  dave@company.com  33000 Accounting
## 5   Eve   eve@company.com  44000      Sales
## 6 Frank  frank@comany.com  37000      Sales
dbClearResult(res)

Closing the connection

dbDisconnect(con)

SQL Queries

Following we demonstrate some common SQL commands, although all task can be achieved by uisng dlyr as well.

Connecting

con <- dbConnect(RSQLite::SQLite(), dbname = "employee.sqlite")
dbListTables(con)
## [1] "employees" "phone"
knitr::opts_chunk$set(connection = "con")

SELECT statements

SELECT * FROM employees;
6 records
name email salary dept
Alice alice@company.com 52000 Accounting
Bob bob@company.com 40000 Accounting
Carol carol@company.com 30000 Sales
Dave dave@company.com 33000 Accounting
Eve eve@company.com 44000 Sales
Frank frank@comany.com 37000 Sales
SELECT * FROM phone;
4 records
name phone
Bob 010- 555-1111
Carol 010- 555-2222
Eve 010- 555-3333
Frank 010- 555-4444

Select using SELECT

SELECT name AS first_name, salary FROM employees;
6 records
first_name salary
Alice 52000
Bob 40000
Carol 30000
Dave 33000
Eve 44000
Frank 37000

Arrange using ORDER BY

SELECT name AS first_name, salary FROM employees ORDER BY salary;
6 records
first_name salary
Carol 30000
Dave 33000
Frank 37000
Bob 40000
Eve 44000
Alice 52000

Descending order:

SELECT name AS first_name, salary FROM employees ORDER BY salary DESC;
6 records
first_name salary
Alice 52000
Eve 44000
Bob 40000
Frank 37000
Dave 33000
Carol 30000

Filter via WHERE

SELECT * FROM employees WHERE salary < 40000
3 records
name email salary dept
Carol carol@company.com 30000 Sales
Dave dave@company.com 33000 Accounting
Frank frank@comany.com 37000 Sales

Group_by via GROUP BY

SELECT * FROM employees GROUP BY dept;
2 records
name email salary dept
Dave dave@company.com 33000 Accounting
Frank frank@comany.com 37000 Sales

Head via LIMIT

SELECT * FROM employees LIMIT 3;
SELECT * FROM employees ORDER BY name DESC LIMIT 3;
3 records
name email salary dept
Alice alice@company.com 52000 Accounting
Bob bob@company.com 40000 Accounting
Carol carol@company.com 30000 Sales

Join two tables (default)

By default SQLite uses a CROSS JOIN which is not terribly useful

SELECT * FROM employees JOIN phone;
Displaying records 1 - 10
name email salary dept name..5 phone
Alice alice@company.com 52000 Accounting Bob 010- 555-1111
Alice alice@company.com 52000 Accounting Carol 010- 555-2222
Alice alice@company.com 52000 Accounting Eve 010- 555-3333
Alice alice@company.com 52000 Accounting Frank 010- 555-4444
Bob bob@company.com 40000 Accounting Bob 010- 555-1111
Bob bob@company.com 40000 Accounting Carol 010- 555-2222
Bob bob@company.com 40000 Accounting Eve 010- 555-3333
Bob bob@company.com 40000 Accounting Frank 010- 555-4444
Carol carol@company.com 30000 Sales Bob 010- 555-1111
Carol carol@company.com 30000 Sales Carol 010- 555-2222

Inner join by NATURAL

SELECT * FROM employees NATURAL JOIN phone;
4 records
name email salary dept phone
Bob bob@company.com 40000 Accounting 010- 555-1111
Carol carol@company.com 30000 Sales 010- 555-2222
Eve eve@company.com 44000 Sales 010- 555-3333
Frank frank@comany.com 37000 Sales 010- 555-4444

Inner join - explicit

SELECT * FROM employees JOIN phone ON employees.name = phone.name;
4 records
name email salary dept name..5 phone
Bob bob@company.com 40000 Accounting Bob 010- 555-1111
Carol carol@company.com 30000 Sales Carol 010- 555-2222
Eve eve@company.com 44000 Sales Eve 010- 555-3333
Frank frank@comany.com 37000 Sales Frank 010- 555-4444

Left join - natural

SELECT * FROM employees NATURAL LEFT JOIN phone;
6 records
name email salary dept phone
Alice alice@company.com 52000 Accounting NA
Bob bob@company.com 40000 Accounting 010- 555-1111
Carol carol@company.com 30000 Sales 010- 555-2222
Dave dave@company.com 33000 Accounting NA
Eve eve@company.com 44000 Sales 010- 555-3333
Frank frank@comany.com 37000 Sales 010- 555-4444

Left join - explicit

SELECT * FROM employees LEFT JOIN phone ON employees.name = phone.name;
6 records
name email salary dept name..5 phone
Alice alice@company.com 52000 Accounting NA NA
Bob bob@company.com 40000 Accounting Bob 010- 555-1111
Carol carol@company.com 30000 Sales Carol 010- 555-2222
Dave dave@company.com 33000 Accounting NA NA
Eve eve@company.com 44000 Sales Eve 010- 555-3333
Frank frank@comany.com 37000 Sales Frank 010- 555-4444

Other joins

SQLite does not support directly an OUTER JOIN or a RIGHT JOIN.

Creating indices

CREATE INDEX index_name ON employees (name);
CREATE INDEX index_name_email ON employees (name, email);
sqlite3 employee.sqlite .indices
## index_name
## index_name_email

Close connection

dbDisconnect(con)