Most material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html by Dr. Colin Rundel to handle a bigish data set.
We will learn:
Import data from a bigish csv file.
Tidy data.
Deposit data into an SQLite database.
Query SQLite database.
Transform in database and plot in R.
sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6 purrr_0.2.5
## [5] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0
## [9] tidyverse_1.2.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.17 cellranger_1.1.0 pillar_1.3.0 compiler_3.5.0
## [5] plyr_1.8.4 bindr_0.1.1 tools_3.5.0 digest_0.6.15
## [9] lubridate_1.7.4 jsonlite_1.5 evaluate_0.10.1 nlme_3.1-137
## [13] gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.2 rlang_0.2.1
## [17] cli_1.0.0 rstudioapi_0.7 yaml_2.1.19 haven_1.1.2
## [21] bindrcpp_0.2.2 withr_2.1.2 xml2_1.2.0 httr_1.3.1
## [25] knitr_1.20 hms_0.4.2 rprojroot_1.3-2 grid_3.5.0
## [29] tidyselect_0.2.4 glue_1.2.0 R6_2.2.2 readxl_1.1.0
## [33] rmarkdown_1.10 modelr_0.1.2 magrittr_1.5 backports_1.1.2
## [37] scales_1.0.0 htmltools_0.3.6 rvest_0.3.2 assertthat_0.2.0
## [41] colorspace_1.3-2 stringi_1.2.3 lazyeval_0.2.1 munsell_0.5.0
## [45] broom_0.5.0 crayon_1.3.4
The /home/stat326_621a/data/nyc_parking/NYParkingViolations.csv
file on teaching server contains information about parking tickets in NYC. This file is also available at here.
You can create a symbolic link to the data file by Bash command
ln -sf /home/stat326_621a/data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv
The CSV file is about 1.7GB
ls -l NYParkingViolations.csv
## -rwxr-xr-x@ 1 jhwon staff 1869025315 Oct 26 2017 NYParkingViolations.csv
How many rows?
wc -l NYParkingViolations.csv
## 9100279 NYParkingViolations.csv
First few lines:
head -5 NYParkingViolations.csv
## Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation,Latitude,Longitude,Community Board,Community Council ,Census Tract,BIN,BBL,NTA
## 1283294138,GBB9093,NY,PAS,08/04/2013,46,SUBN,AUDI,P,37250,13610,21190,20140831,0033,33,33,921043,0033,0000,0752A,,,F,712,W 175 ST,,0,408,F1,,BBBBBBB,ALL,ALL,GY,0,2013,-,0,,,,,,,,,,,,,
## 1283294151,62416MB,NY,COM,08/04/2013,46,VAN,FORD,P,37290,40404,40404,20140430,0033,33,33,921043,0033,0000,1240P,,NY,O,201,W 177 ST,,0,408,C,,BBBBBBB,ALL,ALL,WH,0,2012,-,0,,,,,,,,,,,,,
## 1283294163,78755JZ,NY,COM,08/05/2013,46,P-U,CHEVR,P,37030,31190,13610,20140228,0033,33,33,921043,0033,0000,1243P,,NY,O,520,W 163 ST,,0,408,F7,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,,,,,,,,,
## 1283294175,63009MA,NY,COM,08/05/2013,46,VAN,FORD,P,37270,11710,12010,20141031,0033,33,33,921043,0033,0000,0232P,,NY,O,517,W 176 ST,,0,408,F1,,BBBBBBB,ALL,ALL,WH,0,2010,-,0,,,,,,,,,,,,,
read.csv()
function in base R takes nearly 6 minutes to read in the 1.7GB csv file:
system.time(read.csv("NYParkingViolations.csv"))
## user system elapsed
## 223.428 21.170 249.878
data.table is an R package for reading large data sets:
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
system.time({nyc = fread("NYParkingViolations.csv")})
## user system elapsed
## 47.633 2.359 14.419
class(nyc)
## [1] "data.table" "data.frame"
nyc = as.data.frame(nyc)
class(nyc)
## [1] "data.frame"
read_csv()
function in tidyverse is 3-4 times faster than base R:
library("tidyverse")
system.time({nyc = read_csv("NYParkingViolations.csv")})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_integer(),
## `Street Code1` = col_integer(),
## `Street Code2` = col_integer(),
## `Street Code3` = col_integer(),
## `Vehicle Expiration Date` = col_integer(),
## `Violation Precinct` = col_integer(),
## `Issuer Precinct` = col_integer(),
## `Issuer Code` = col_integer(),
## `Date First Observed` = col_integer(),
## `Law Section` = col_integer(),
## `Violation Legal Code` = col_integer(),
## `Unregistered Vehicle?` = col_integer(),
## `Vehicle Year` = col_integer(),
## `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 53024 Violation Legal Code an integer T 'NYParkingViolations.csv' file 2 53385 Violation Legal Code an integer T 'NYParkingViolations.csv' row 3 53386 Violation Legal Code an integer T 'NYParkingViolations.csv' col 4 53387 Violation Legal Code an integer T 'NYParkingViolations.csv' expected 5 53388 Violation Legal Code an integer T 'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.
## user system elapsed
## 56.384 8.228 66.130
nyc
## # A tibble: 9,100,278 x 51
## `Summons Number` `Plate ID` `Registration S… `Plate Type` `Issue Date`
## <dbl> <chr> <chr> <chr> <chr>
## 1 1283294138 GBB9093 NY PAS 08/04/2013
## 2 1283294151 62416MB NY COM 08/04/2013
## 3 1283294163 78755JZ NY COM 08/05/2013
## 4 1283294175 63009MA NY COM 08/05/2013
## 5 1283294187 91648MC NY COM 08/08/2013
## 6 1283294217 T60DAR NJ PAS 08/11/2013
## 7 1283294229 GCR2838 NY PAS 08/11/2013
## 8 1283983620 XZ764G NJ PAS 08/07/2013
## 9 1283983631 GBH9379 NY PAS 08/07/2013
## 10 1283983667 MCL78B NJ PAS 07/18/2013
## # ... with 9,100,268 more rows, and 46 more variables: `Violation
## # Code` <int>, `Vehicle Body Type` <chr>, `Vehicle Make` <chr>, `Issuing
## # Agency` <chr>, `Street Code1` <int>, `Street Code2` <int>, `Street
## # Code3` <int>, `Vehicle Expiration Date` <int>, `Violation
## # Location` <chr>, `Violation Precinct` <int>, `Issuer Precinct` <int>,
## # `Issuer Code` <int>, `Issuer Command` <chr>, `Issuer Squad` <chr>,
## # `Violation Time` <chr>, `Time First Observed` <chr>, `Violation
## # County` <chr>, `Violation In Front Of Or Opposite` <chr>, `House
## # Number` <chr>, `Street Name` <chr>, `Intersecting Street` <chr>, `Date
## # First Observed` <int>, `Law Section` <int>, `Sub Division` <chr>,
## # `Violation Legal Code` <int>, `Days Parking In Effect` <chr>, `From
## # Hours In Effect` <chr>, `To Hours In Effect` <chr>, `Vehicle
## # Color` <chr>, `Unregistered Vehicle?` <int>, `Vehicle Year` <int>,
## # `Meter Number` <chr>, `Feet From Curb` <int>, `Violation Post
## # Code` <chr>, `Violation Description` <chr>, `No Standing or Stopping
## # Violation` <chr>, `Hydrant Violation` <chr>, `Double Parking
## # Violation` <chr>, Latitude <chr>, Longitude <chr>, `Community
## # Board` <chr>, `Community Council` <chr>, `Census Tract` <chr>,
## # BIN <chr>, BBL <chr>, NTA <chr>
names(nyc) <- str_replace_all(names(nyc), " ", "_")
head(nyc)
## # A tibble: 6 x 51
## Summons_Number Plate_ID Registration_St… Plate_Type Issue_Date
## <dbl> <chr> <chr> <chr> <chr>
## 1 1283294138 GBB9093 NY PAS 08/04/2013
## 2 1283294151 62416MB NY COM 08/04/2013
## 3 1283294163 78755JZ NY COM 08/05/2013
## 4 1283294175 63009MA NY COM 08/05/2013
## 5 1283294187 91648MC NY COM 08/08/2013
## 6 1283294217 T60DAR NJ PAS 08/11/2013
## # ... with 46 more variables: Violation_Code <int>,
## # Vehicle_Body_Type <chr>, Vehicle_Make <chr>, Issuing_Agency <chr>,
## # Street_Code1 <int>, Street_Code2 <int>, Street_Code3 <int>,
## # Vehicle_Expiration_Date <int>, Violation_Location <chr>,
## # Violation_Precinct <int>, Issuer_Precinct <int>, Issuer_Code <int>,
## # Issuer_Command <chr>, Issuer_Squad <chr>, Violation_Time <chr>,
## # Time_First_Observed <chr>, Violation_County <chr>,
## # Violation_In_Front_Of_Or_Opposite <chr>, House_Number <chr>,
## # Street_Name <chr>, Intersecting_Street <chr>,
## # Date_First_Observed <int>, Law_Section <int>, Sub_Division <chr>,
## # Violation_Legal_Code <int>, Days_Parking_In_Effect <chr>,
## # From_Hours_In_Effect <chr>, To_Hours_In_Effect <chr>,
## # Vehicle_Color <chr>, `Unregistered_Vehicle?` <int>,
## # Vehicle_Year <int>, Meter_Number <chr>, Feet_From_Curb <int>,
## # Violation_Post_Code <chr>, Violation_Description <chr>,
## # No_Standing_or_Stopping_Violation <chr>, Hydrant_Violation <chr>,
## # Double_Parking_Violation <chr>, Latitude <chr>, Longitude <chr>,
## # Community_Board <chr>, Community_Council <chr>, Census_Tract <chr>,
## # BIN <chr>, BBL <chr>, NTA <chr>
We make a smaller data set with selected variables:
# magrittr for %<>%
library("magrittr")
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
(nyc %<>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
House_Number:Intersecting_Street, Vehicle_Color))
## # A tibble: 9,100,278 x 14
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <chr> <int> <chr>
## 1 NY PAS 08/04/2013 46 SUBN
## 2 NY COM 08/04/2013 46 VAN
## 3 NY COM 08/05/2013 46 P-U
## 4 NY COM 08/05/2013 46 VAN
## 5 NY COM 08/08/2013 41 TRLR
## 6 NJ PAS 08/11/2013 14 P-U
## 7 NY PAS 08/11/2013 14 VAN
## 8 NJ PAS 08/07/2013 24 DELV
## 9 NY PAS 08/07/2013 24 SDN
## 10 NJ PAS 07/18/2013 24 SDN
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## # Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
Hadley Wickham’s lubridate package facilitates handling date and time. Read Chapter 16 of R for Data Science for more information.
Make variable Issue_Date
of date-time type:
library("lubridate")
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
class(nyc$Issue_Date)
## [1] "character"
nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)
## [1] "Date"
nyc
## # A tibble: 9,100,278 x 14
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <int> <chr>
## 1 NY PAS 2013-08-04 46 SUBN
## 2 NY COM 2013-08-04 46 VAN
## 3 NY COM 2013-08-05 46 P-U
## 4 NY COM 2013-08-05 46 VAN
## 5 NY COM 2013-08-08 41 TRLR
## 6 NJ PAS 2013-08-11 14 P-U
## 7 NY PAS 2013-08-11 14 VAN
## 8 NJ PAS 2013-08-07 24 DELV
## 9 NY PAS 2013-08-07 24 SDN
## 10 NJ PAS 2013-07-18 24 SDN
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## # Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
Only keep data in years 2013-4.
range(nyc$Issue_Date)
## [1] "1970-12-18" "2069-12-23"
nyc$Issue_Date %>% year() %>% table()
## .
## 1970 1971 1973 1974 1976 1977 1979 1981 1983
## 1 6 10 1 2 1 2 4 1
## 1984 1987 1990 1991 1996 2000 2001 2002 2003
## 2 3 2 1 1 319 91 7 39
## 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 77 9 11 13 8 9 129 251 618
## 2013 2014 2015 2016 2017 2018 2019 2020 2021
## 4379109 4716512 1522 296 309 181 329 18 26
## 2022 2023 2024 2025 2026 2027 2028 2029 2030
## 1 31 23 10 4 4 7 3 45
## 2031 2032 2033 2040 2041 2043 2044 2045 2046
## 93 3 8 1 39 9 9 2 7
## 2047 2048 2049 2050 2051 2052 2053 2060 2061
## 6 1 3 1 12 2 1 3 10
## 2063 2064 2066 2067 2069
## 9 5 3 2 1
filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))
## # A tibble: 9,095,621 x 14
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <int> <chr>
## 1 NY PAS 2013-08-04 46 SUBN
## 2 NY COM 2013-08-04 46 VAN
## 3 NY COM 2013-08-05 46 P-U
## 4 NY COM 2013-08-05 46 VAN
## 5 NY COM 2013-08-08 41 TRLR
## 6 NJ PAS 2013-08-11 14 P-U
## 7 NY PAS 2013-08-11 14 VAN
## 8 NJ PAS 2013-08-07 24 DELV
## 9 NY PAS 2013-08-07 24 SDN
## 10 NJ PAS 2013-07-18 24 SDN
## # ... with 9,095,611 more rows, and 9 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## # Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>
system.time({
nyc <- read_csv("NYParkingViolations.csv") %>%
setNames(str_replace_all(names(.)," ", "_")) %>%
select(Registration_State:Issuing_Agency,
Violation_Location, Violation_Precinct, Violation_Time,
House_Number:Intersecting_Street, Vehicle_Color) %>%
mutate(Issue_Date = mdy(Issue_Date)) %>%
mutate(Issue_Day = day(Issue_Date),
Issue_Month = month(Issue_Date),
Issue_Year = year(Issue_Date),
Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
filter(Issue_Year %in% 2013:2014)
})
## Parsed with column specification:
## cols(
## .default = col_character(),
## `Summons Number` = col_double(),
## `Violation Code` = col_integer(),
## `Street Code1` = col_integer(),
## `Street Code2` = col_integer(),
## `Street Code3` = col_integer(),
## `Vehicle Expiration Date` = col_integer(),
## `Violation Precinct` = col_integer(),
## `Issuer Precinct` = col_integer(),
## `Issuer Code` = col_integer(),
## `Date First Observed` = col_integer(),
## `Law Section` = col_integer(),
## `Violation Legal Code` = col_integer(),
## `Unregistered Vehicle?` = col_integer(),
## `Vehicle Year` = col_integer(),
## `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual file expected <int> <chr> <chr> <chr> <chr> actual 1 53024 Violation Legal Code an integer T 'NYParkingViolations.csv' file 2 53385 Violation Legal Code an integer T 'NYParkingViolations.csv' row 3 53386 Violation Legal Code an integer T 'NYParkingViolations.csv' col 4 53387 Violation Legal Code an integer T 'NYParkingViolations.csv' expected 5 53388 Violation Legal Code an integer T 'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.
## user system elapsed
## 59.294 8.080 68.867
nyc
## # A tibble: 9,095,621 x 18
## Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
## <chr> <chr> <date> <int> <chr>
## 1 NY PAS 2013-08-04 46 SUBN
## 2 NY COM 2013-08-04 46 VAN
## 3 NY COM 2013-08-05 46 P-U
## 4 NY COM 2013-08-05 46 VAN
## 5 NY COM 2013-08-08 41 TRLR
## 6 NJ PAS 2013-08-11 14 P-U
## 7 NY PAS 2013-08-11 14 VAN
## 8 NJ PAS 2013-08-07 24 DELV
## 9 NY PAS 2013-08-07 24 SDN
## 10 NJ PAS 2013-07-18 24 SDN
## # ... with 9,095,611 more rows, and 13 more variables: Vehicle_Make <chr>,
## # Issuing_Agency <chr>, Violation_Location <chr>,
## # Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## # Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## # Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>
nyc %>%
group_by(Issue_Date) %>%
summarize(n = n()) %>%
ggplot(aes(x = Issue_Date, y = n)) +
geom_line() +
xlim(mdy("7/1/2013"), mdy("6/30/2014"))
## Warning: Removed 365 rows containing missing values (geom_path).
Import the nyc
data set into the SQLite database:
library("DBI")
library("RSQLite")
library("tidyverse")
if (Sys.info()[["sysname"]] == "Linux") {
db <- dbConnect(RSQLite::SQLite(),
dbname = "/home/stat326_621a/data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
db <- dbConnect(RSQLite::SQLite(),
dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)
## [1] "nyc"
dbDisconnect(db)
You can create a symbolic link to the database file by Bash command
ln -sf /home/stat326_621a/data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite
Size of the database file
ls -l NYParkingViolations.sqlite
## -rw-r--r--@ 1 jhwon staff 731770880 Nov 1 11:06 NYParkingViolations.sqlite
Connect to table in database:
db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)
## [1] "nyc"
nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
## .. .. ..@ ptr :<externalptr>
## .. .. ..@ dbname : chr "/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite"
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x7fbeb1eee9a8>
## .. .. ..@ bigint : chr "integer64"
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "nyc"
## ..$ vars: chr [1:18] "Registration_State" "Plate_Type" "Issue_Date" "Violation_Code" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
nyc_sql %>% print(width = Inf)
## # Source: table<nyc> [?? x 18]
## # Database: sqlite 3.22.0
## # [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## Registration_State Plate_Type Issue_Date Violation_Code
## <chr> <chr> <dbl> <int>
## 1 NY PAS 15921 46
## 2 NY COM 15921 46
## 3 NY COM 15922 46
## 4 NY COM 15922 46
## 5 NY COM 15925 41
## 6 NJ PAS 15928 14
## 7 NY PAS 15928 14
## 8 NJ PAS 15924 24
## 9 NY PAS 15924 24
## 10 NJ PAS 15904 24
## Vehicle_Body_Type Vehicle_Make Issuing_Agency Violation_Location
## <chr> <chr> <chr> <chr>
## 1 SUBN AUDI P 0033
## 2 VAN FORD P 0033
## 3 P-U CHEVR P 0033
## 4 VAN FORD P 0033
## 5 TRLR GMC P 0033
## 6 P-U DODGE P 0033
## 7 VAN <NA> P 0033
## 8 DELV FORD X 0088
## 9 SDN TOYOT X 0088
## 10 SDN SUBAR H 0079
## Violation_Precinct Violation_Time House_Number Street_Name
## <int> <chr> <chr> <chr>
## 1 33 0752A 712 W 175 ST
## 2 33 1240P 201 W 177 ST
## 3 33 1243P 520 W 163 ST
## 4 33 0232P 517 W 176 ST
## 5 33 1239P 590 W 174 ST
## 6 33 0617P 525 W 175 ST
## 7 33 0741P 551 W 175 ST
## 8 88 0425A 100 N.PORTLAND AVE
## 9 88 0437A 100 N.PORTLAND AVE
## 10 79 0839A <NA> 760 BROADWAY ,BROOKL
## Intersecting_Street Vehicle_Color Issue_Day Issue_Month Issue_Year
## <chr> <chr> <int> <dbl> <dbl>
## 1 <NA> GY 4 8 2013
## 2 <NA> WH 4 8 2013
## 3 <NA> <NA> 5 8 2013
## 4 <NA> WH 5 8 2013
## 5 <NA> BR 8 8 2013
## 6 <NA> RD 11 8 2013
## 7 <NA> GN 11 8 2013
## 8 BROOKLYN NY 11206 WHITE 7 8 2013
## 9 BROOKLYN NY 11206 WHITE 7 8 2013
## 10 WOODHULL HOSPITAL BLACK 18 7 2013
## Issue_WDay
## <chr>
## 1 Sun
## 2 Sun
## 3 Mon
## 4 Mon
## 5 Thu
## 6 Sun
## 7 Sun
## 8 Wed
## 9 Wed
## 10 Thu
## # ... with more rows
addr <- nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr
## # Source: lazy query [?? x 5]
## # Database: sqlite 3.22.0
## # [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Precin… House_Number Street_Name
## <dbl> <chr> <int> <chr> <chr>
## 1 15921 P 33 712 W 175 ST
## 2 15921 P 33 201 W 177 ST
## 3 15922 P 33 520 W 163 ST
## 4 15922 P 33 517 W 176 ST
## 5 15925 P 33 590 W 174 ST
## 6 15928 P 33 525 W 175 ST
## 7 15928 P 33 551 W 175 ST
## 8 15891 P 13 59 W 21 STREET
## 9 15906 T 28 <NA> N/S WILLOUGHB…
## 10 15904 T 1 <NA> <NA>
## # ... with more rows
class(addr)
## [1] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
show_query(addr)
## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `House_Number`, `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
addr %>% mutate(address = paste(House_Number, Street_Name))
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.22.0
## # [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## Issue_Date Issuing_Agency Violation_Preci… House_Number Street_Name
## <dbl> <chr> <int> <chr> <chr>
## 1 15921 P 33 712 W 175 ST
## 2 15921 P 33 201 W 177 ST
## 3 15922 P 33 520 W 163 ST
## 4 15922 P 33 517 W 176 ST
## 5 15925 P 33 590 W 174 ST
## 6 15928 P 33 525 W 175 ST
## 7 15928 P 33 551 W 175 ST
## 8 15891 P 13 59 W 21 STREET
## 9 15906 T 28 <NA> N/S WILLOU…
## 10 15904 T 1 <NA> <NA>
## # ... with more rows, and 1 more variable: address <chr>
addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))
## # Source: lazy query [?? x 1]
## # Database: sqlite 3.22.0
## # [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## mean
## <dbl>
## 1 16.1
addr %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.22.0
## # [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
## <chr> <int> <int>
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
addr %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n()) %>%
show_query()
## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `House_Number`, `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`
dbplyr package (a dplyr backend for databases) has a function, translate_sql
, that lets you experiment with how R functions are translated to SQL:
library("dbplyr")
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> PASTE("x", "y")
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `avg(x, na.rm = TRUE)` to silence this warning
## <SQL> avg("x") OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> avg("x") OVER ()
dplyr using tibble:
system.time(
nyc %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
)
## user system elapsed
## 0.340 0.090 0.436
dplyr using SQLite:
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
)
## user system elapsed
## 0.005 0.000 0.004
nyc_sql
was 30x times faster than nyc
, but the former is disk based while the latter is in memory. Why is the discrepancy?
dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.
When building a query, we don’t want the entire table, often we want just enough to check if our query is working.
Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.
Therefore, by default dplyr
won’t connect and query the database until absolutely necessary (e.g. show output),
and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n())
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.22.0
## # [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## # Groups: Issuing_Agency
## Issuing_Agency Violation_Precinct n
## <chr> <int> <int>
## 1 A 1 13
## 2 A 7 1
## 3 A 10 24
## 4 A 11 1
## 5 A 14 47
## 6 A 33 11
## 7 B 25 2
## 8 C 5 73
## 9 C 13 7
## 10 D 1 1
## # ... with more rows
To force a full query and return a complete table it is necessary to use the collect
function.
system.time(
nyc_sql %>%
select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
group_by(Issuing_Agency, Violation_Precinct) %>%
summarize(n = n()) %>%
collect()
)
## user system elapsed
## 3.453 0.341 3.830
Suppose we want the bar plot of number of tickets on each weekday. We encounter error with the usual ggplot2 command:
nyc_sql %>%
ggplot() +
geom_bar(aes(x = Issue_WDay))
This is because ggplot2 needs to compute the count per category by going through all the rows. But here nyc_sql
is just a pointer to the SQLite table. We have to use the transform in database, plot in R strategy.
nyc_sql %>%
group_by(Issue_WDay) %>%
count() %>%
collect() %>%
ggplot() +
geom_col(aes(x = Issue_WDay, y = n))
For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.
dbDisconnect(db)