Acknowledgement

Most material in this lecture is adpated from http://www2.stat.duke.edu/~cr173/Sta523_Fa17/sql.html and http://www2.stat.duke.edu/~cr173/Sta523_Fa17/bigish_data.html by Dr. Colin Rundel to handle a bigish data set.

Introduction

We will learn:

Machine information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.3.0   stringr_1.3.1   dplyr_0.7.6     purrr_0.2.5    
## [5] readr_1.1.1     tidyr_0.8.1     tibble_1.4.2    ggplot2_3.0.0  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.17     cellranger_1.1.0 pillar_1.3.0     compiler_3.5.0  
##  [5] plyr_1.8.4       bindr_0.1.1      tools_3.5.0      digest_0.6.15   
##  [9] lubridate_1.7.4  jsonlite_1.5     evaluate_0.10.1  nlme_3.1-137    
## [13] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.2  rlang_0.2.1     
## [17] cli_1.0.0        rstudioapi_0.7   yaml_2.1.19      haven_1.1.2     
## [21] bindrcpp_0.2.2   withr_2.1.2      xml2_1.2.0       httr_1.3.1      
## [25] knitr_1.20       hms_0.4.2        rprojroot_1.3-2  grid_3.5.0      
## [29] tidyselect_0.2.4 glue_1.2.0       R6_2.2.2         readxl_1.1.0    
## [33] rmarkdown_1.10   modelr_0.1.2     magrittr_1.5     backports_1.1.2 
## [37] scales_1.0.0     htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.0
## [41] colorspace_1.3-2 stringi_1.2.3    lazyeval_0.2.1   munsell_0.5.0   
## [45] broom_0.5.0      crayon_1.3.4

CSV file

The /home/stat326_621a/data/nyc_parking/NYParkingViolations.csv file on teaching server contains information about parking tickets in NYC. This file is also available at here.

You can create a symbolic link to the data file by Bash command

ln -sf /home/stat326_621a/data/nyc_parking/NYParkingViolations.csv NYParkingViolations.csv

The CSV file is about 1.7GB

ls -l NYParkingViolations.csv
## -rwxr-xr-x@ 1 jhwon  staff  1869025315 Oct 26  2017 NYParkingViolations.csv

How many rows?

wc -l NYParkingViolations.csv
##  9100279 NYParkingViolations.csv

First few lines:

head -5 NYParkingViolations.csv
## Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation,Latitude,Longitude,Community Board,Community Council ,Census Tract,BIN,BBL,NTA
## 1283294138,GBB9093,NY,PAS,08/04/2013,46,SUBN,AUDI,P,37250,13610,21190,20140831,0033,33,33,921043,0033,0000,0752A,,,F,712,W 175 ST,,0,408,F1,,BBBBBBB,ALL,ALL,GY,0,2013,-,0,,,,,,,,,,,,,
## 1283294151,62416MB,NY,COM,08/04/2013,46,VAN,FORD,P,37290,40404,40404,20140430,0033,33,33,921043,0033,0000,1240P,,NY,O,201,W 177 ST,,0,408,C,,BBBBBBB,ALL,ALL,WH,0,2012,-,0,,,,,,,,,,,,,
## 1283294163,78755JZ,NY,COM,08/05/2013,46,P-U,CHEVR,P,37030,31190,13610,20140228,0033,33,33,921043,0033,0000,1243P,,NY,O,520,W 163 ST,,0,408,F7,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,,,,,,,,,
## 1283294175,63009MA,NY,COM,08/05/2013,46,VAN,FORD,P,37270,11710,12010,20141031,0033,33,33,921043,0033,0000,0232P,,NY,O,517,W 176 ST,,0,408,F1,,BBBBBBB,ALL,ALL,WH,0,2010,-,0,,,,,,,,,,,,,

Read csv file

Fix column names

names(nyc) <- str_replace_all(names(nyc), " ", "_")
head(nyc)
## # A tibble: 6 x 51
##   Summons_Number Plate_ID Registration_St… Plate_Type Issue_Date
##            <dbl> <chr>    <chr>            <chr>      <chr>     
## 1     1283294138 GBB9093  NY               PAS        08/04/2013
## 2     1283294151 62416MB  NY               COM        08/04/2013
## 3     1283294163 78755JZ  NY               COM        08/05/2013
## 4     1283294175 63009MA  NY               COM        08/05/2013
## 5     1283294187 91648MC  NY               COM        08/08/2013
## 6     1283294217 T60DAR   NJ               PAS        08/11/2013
## # ... with 46 more variables: Violation_Code <int>,
## #   Vehicle_Body_Type <chr>, Vehicle_Make <chr>, Issuing_Agency <chr>,
## #   Street_Code1 <int>, Street_Code2 <int>, Street_Code3 <int>,
## #   Vehicle_Expiration_Date <int>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Issuer_Precinct <int>, Issuer_Code <int>,
## #   Issuer_Command <chr>, Issuer_Squad <chr>, Violation_Time <chr>,
## #   Time_First_Observed <chr>, Violation_County <chr>,
## #   Violation_In_Front_Of_Or_Opposite <chr>, House_Number <chr>,
## #   Street_Name <chr>, Intersecting_Street <chr>,
## #   Date_First_Observed <int>, Law_Section <int>, Sub_Division <chr>,
## #   Violation_Legal_Code <int>, Days_Parking_In_Effect <chr>,
## #   From_Hours_In_Effect <chr>, To_Hours_In_Effect <chr>,
## #   Vehicle_Color <chr>, `Unregistered_Vehicle?` <int>,
## #   Vehicle_Year <int>, Meter_Number <chr>, Feet_From_Curb <int>,
## #   Violation_Post_Code <chr>, Violation_Description <chr>,
## #   No_Standing_or_Stopping_Violation <chr>, Hydrant_Violation <chr>,
## #   Double_Parking_Violation <chr>, Latitude <chr>, Longitude <chr>,
## #   Community_Board <chr>, Community_Council <chr>, Census_Tract <chr>,
## #   BIN <chr>, BBL <chr>, NTA <chr>

Simplifying

We make a smaller data set with selected variables:

# magrittr for %<>%
library("magrittr")
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract
(nyc %<>% 
  select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         House_Number:Intersecting_Street, Vehicle_Color))
## # A tibble: 9,100,278 x 14
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <chr>               <int> <chr>           
##  1 NY               PAS        08/04/2013             46 SUBN            
##  2 NY               COM        08/04/2013             46 VAN             
##  3 NY               COM        08/05/2013             46 P-U             
##  4 NY               COM        08/05/2013             46 VAN             
##  5 NY               COM        08/08/2013             41 TRLR            
##  6 NJ               PAS        08/11/2013             14 P-U             
##  7 NY               PAS        08/11/2013             14 VAN             
##  8 NJ               PAS        08/07/2013             24 DELV            
##  9 NY               PAS        08/07/2013             24 SDN             
## 10 NJ               PAS        07/18/2013             24 SDN             
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## #   Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Fix dates

Hadley Wickham’s lubridate package facilitates handling date and time. Read Chapter 16 of R for Data Science for more information.

Make variable Issue_Date of date-time type:

library("lubridate")
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year
## The following object is masked from 'package:base':
## 
##     date
class(nyc$Issue_Date)
## [1] "character"
nyc %<>% mutate(Issue_Date = mdy(Issue_Date))
class(nyc$Issue_Date)
## [1] "Date"
nyc
## # A tibble: 9,100,278 x 14
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <date>              <int> <chr>           
##  1 NY               PAS        2013-08-04             46 SUBN            
##  2 NY               COM        2013-08-04             46 VAN             
##  3 NY               COM        2013-08-05             46 P-U             
##  4 NY               COM        2013-08-05             46 VAN             
##  5 NY               COM        2013-08-08             41 TRLR            
##  6 NJ               PAS        2013-08-11             14 P-U             
##  7 NY               PAS        2013-08-11             14 VAN             
##  8 NJ               PAS        2013-08-07             24 DELV            
##  9 NY               PAS        2013-08-07             24 SDN             
## 10 NJ               PAS        2013-07-18             24 SDN             
## # ... with 9,100,268 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## #   Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Only keep data in years 2013-4.

range(nyc$Issue_Date)
## [1] "1970-12-18" "2069-12-23"
nyc$Issue_Date %>% year() %>% table()
## .
##    1970    1971    1973    1974    1976    1977    1979    1981    1983 
##       1       6      10       1       2       1       2       4       1 
##    1984    1987    1990    1991    1996    2000    2001    2002    2003 
##       2       3       2       1       1     319      91       7      39 
##    2004    2005    2006    2007    2008    2009    2010    2011    2012 
##      77       9      11      13       8       9     129     251     618 
##    2013    2014    2015    2016    2017    2018    2019    2020    2021 
## 4379109 4716512    1522     296     309     181     329      18      26 
##    2022    2023    2024    2025    2026    2027    2028    2029    2030 
##       1      31      23      10       4       4       7       3      45 
##    2031    2032    2033    2040    2041    2043    2044    2045    2046 
##      93       3       8       1      39       9       9       2       7 
##    2047    2048    2049    2050    2051    2052    2053    2060    2061 
##       6       1       3       1      12       2       1       3      10 
##    2063    2064    2066    2067    2069 
##       9       5       3       2       1
filter(nyc, Issue_Date >= mdy("1/1/2013"), Issue_Date <= mdy("12/31/2014"))
## # A tibble: 9,095,621 x 14
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <date>              <int> <chr>           
##  1 NY               PAS        2013-08-04             46 SUBN            
##  2 NY               COM        2013-08-04             46 VAN             
##  3 NY               COM        2013-08-05             46 P-U             
##  4 NY               COM        2013-08-05             46 VAN             
##  5 NY               COM        2013-08-08             41 TRLR            
##  6 NJ               PAS        2013-08-11             14 P-U             
##  7 NY               PAS        2013-08-11             14 VAN             
##  8 NJ               PAS        2013-08-07             24 DELV            
##  9 NY               PAS        2013-08-07             24 SDN             
## 10 NJ               PAS        2013-07-18             24 SDN             
## # ... with 9,095,611 more rows, and 9 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## #   Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>

Putting it all together

system.time({
  nyc <- read_csv("NYParkingViolations.csv") %>%
    setNames(str_replace_all(names(.)," ", "_")) %>%
    select(Registration_State:Issuing_Agency, 
         Violation_Location, Violation_Precinct, Violation_Time,
         House_Number:Intersecting_Street, Vehicle_Color) %>%
    mutate(Issue_Date = mdy(Issue_Date)) %>% 
    mutate(Issue_Day = day(Issue_Date),
           Issue_Month = month(Issue_Date),
           Issue_Year = year(Issue_Date),
           Issue_WDay = wday(Issue_Date, label=TRUE)) %>%
    filter(Issue_Year %in% 2013:2014)  
})
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   `Summons Number` = col_double(),
##   `Violation Code` = col_integer(),
##   `Street Code1` = col_integer(),
##   `Street Code2` = col_integer(),
##   `Street Code3` = col_integer(),
##   `Vehicle Expiration Date` = col_integer(),
##   `Violation Precinct` = col_integer(),
##   `Issuer Precinct` = col_integer(),
##   `Issuer Code` = col_integer(),
##   `Date First Observed` = col_integer(),
##   `Law Section` = col_integer(),
##   `Violation Legal Code` = col_integer(),
##   `Unregistered Vehicle?` = col_integer(),
##   `Vehicle Year` = col_integer(),
##   `Feet From Curb` = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 654437 parsing failures.
## row # A tibble: 5 x 5 col     row col                  expected   actual file                      expected   <int> <chr>                <chr>      <chr>  <chr>                     actual 1 53024 Violation Legal Code an integer T      'NYParkingViolations.csv' file 2 53385 Violation Legal Code an integer T      'NYParkingViolations.csv' row 3 53386 Violation Legal Code an integer T      'NYParkingViolations.csv' col 4 53387 Violation Legal Code an integer T      'NYParkingViolations.csv' expected 5 53388 Violation Legal Code an integer T      'NYParkingViolations.csv'
## ... ................. ... ........................................................................ ........ ........................................................................ ...... ........................................................................ .... ........................................................................ ... ........................................................................ ... ........................................................................ ........ ........................................................................
## See problems(...) for more details.
##    user  system elapsed 
##  59.294   8.080  68.867
nyc
## # A tibble: 9,095,621 x 18
##    Registration_St… Plate_Type Issue_Date Violation_Code Vehicle_Body_Ty…
##    <chr>            <chr>      <date>              <int> <chr>           
##  1 NY               PAS        2013-08-04             46 SUBN            
##  2 NY               COM        2013-08-04             46 VAN             
##  3 NY               COM        2013-08-05             46 P-U             
##  4 NY               COM        2013-08-05             46 VAN             
##  5 NY               COM        2013-08-08             41 TRLR            
##  6 NJ               PAS        2013-08-11             14 P-U             
##  7 NY               PAS        2013-08-11             14 VAN             
##  8 NJ               PAS        2013-08-07             24 DELV            
##  9 NY               PAS        2013-08-07             24 SDN             
## 10 NJ               PAS        2013-07-18             24 SDN             
## # ... with 9,095,611 more rows, and 13 more variables: Vehicle_Make <chr>,
## #   Issuing_Agency <chr>, Violation_Location <chr>,
## #   Violation_Precinct <int>, Violation_Time <chr>, House_Number <chr>,
## #   Street_Name <chr>, Intersecting_Street <chr>, Vehicle_Color <chr>,
## #   Issue_Day <int>, Issue_Month <dbl>, Issue_Year <dbl>, Issue_WDay <ord>

Ticket frequency

nyc %>% 
  group_by(Issue_Date) %>% 
  summarize(n = n()) %>%
  ggplot(aes(x = Issue_Date, y = n)) + 
    geom_line() + 
    xlim(mdy("7/1/2013"), mdy("6/30/2014"))
## Warning: Removed 365 rows containing missing values (geom_path).

Create an SQLite database

Import the nyc data set into the SQLite database:

library("DBI")
library("RSQLite")
library("tidyverse")
if (Sys.info()[["sysname"]] == "Linux") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "/home/stat326_621a/data/nyc_parking/NYParkingViolations.sqlite")
} else if (Sys.info()[["sysname"]] == "Darwin") {
  db <- dbConnect(RSQLite::SQLite(), 
                  dbname = "./NYParkingViolations.sqlite")
}
dbWriteTable(db, "nyc", nyc, overwrite = TRUE)
dbListTables(db)
## [1] "nyc"
dbDisconnect(db)

You can create a symbolic link to the database file by Bash command

ln -sf /home/stat326_621a/data/nyc_parking/NYParkingViolations.sqlite NYParkingViolations.sqlite

Size of the database file

ls -l NYParkingViolations.sqlite
## -rw-r--r--@ 1 jhwon  staff  731770880 Nov  1 11:06 NYParkingViolations.sqlite

Read data from database

Connect to table in database:

db <- dbConnect(RSQLite::SQLite(), dbname = "./NYParkingViolations.sqlite")
dbListTables(db)
## [1] "nyc"
nyc_sql <- dplyr::tbl(db, "nyc")
str(nyc_sql)
## List of 2
##  $ src:List of 2
##   ..$ con  :Formal class 'SQLiteConnection' [package "RSQLite"] with 7 slots
##   .. .. ..@ ptr                :<externalptr> 
##   .. .. ..@ dbname             : chr "/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite"
##   .. .. ..@ loadable.extensions: logi TRUE
##   .. .. ..@ flags              : int 70
##   .. .. ..@ vfs                : chr ""
##   .. .. ..@ ref                :<environment: 0x7fbeb1eee9a8> 
##   .. .. ..@ bigint             : chr "integer64"
##   ..$ disco: NULL
##   ..- attr(*, "class")= chr [1:3] "src_dbi" "src_sql" "src"
##  $ ops:List of 2
##   ..$ x   : 'ident' chr "nyc"
##   ..$ vars: chr [1:18] "Registration_State" "Plate_Type" "Issue_Date" "Violation_Code" ...
##   ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
##  - attr(*, "class")= chr [1:4] "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
nyc_sql %>% print(width = Inf)
## # Source:   table<nyc> [?? x 18]
## # Database: sqlite 3.22.0
## #   [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
##    Registration_State Plate_Type Issue_Date Violation_Code
##    <chr>              <chr>           <dbl>          <int>
##  1 NY                 PAS             15921             46
##  2 NY                 COM             15921             46
##  3 NY                 COM             15922             46
##  4 NY                 COM             15922             46
##  5 NY                 COM             15925             41
##  6 NJ                 PAS             15928             14
##  7 NY                 PAS             15928             14
##  8 NJ                 PAS             15924             24
##  9 NY                 PAS             15924             24
## 10 NJ                 PAS             15904             24
##    Vehicle_Body_Type Vehicle_Make Issuing_Agency Violation_Location
##    <chr>             <chr>        <chr>          <chr>             
##  1 SUBN              AUDI         P              0033              
##  2 VAN               FORD         P              0033              
##  3 P-U               CHEVR        P              0033              
##  4 VAN               FORD         P              0033              
##  5 TRLR              GMC          P              0033              
##  6 P-U               DODGE        P              0033              
##  7 VAN               <NA>         P              0033              
##  8 DELV              FORD         X              0088              
##  9 SDN               TOYOT        X              0088              
## 10 SDN               SUBAR        H              0079              
##    Violation_Precinct Violation_Time House_Number Street_Name         
##                 <int> <chr>          <chr>        <chr>               
##  1                 33 0752A          712          W 175 ST            
##  2                 33 1240P          201          W 177 ST            
##  3                 33 1243P          520          W 163 ST            
##  4                 33 0232P          517          W 176 ST            
##  5                 33 1239P          590          W 174 ST            
##  6                 33 0617P          525          W 175 ST            
##  7                 33 0741P          551          W 175 ST            
##  8                 88 0425A          100          N.PORTLAND AVE      
##  9                 88 0437A          100          N.PORTLAND AVE      
## 10                 79 0839A          <NA>         760 BROADWAY ,BROOKL
##    Intersecting_Street Vehicle_Color Issue_Day Issue_Month Issue_Year
##    <chr>               <chr>             <int>       <dbl>      <dbl>
##  1 <NA>                GY                    4           8       2013
##  2 <NA>                WH                    4           8       2013
##  3 <NA>                <NA>                  5           8       2013
##  4 <NA>                WH                    5           8       2013
##  5 <NA>                BR                    8           8       2013
##  6 <NA>                RD                   11           8       2013
##  7 <NA>                GN                   11           8       2013
##  8 BROOKLYN NY 11206   WHITE                 7           8       2013
##  9 BROOKLYN NY 11206   WHITE                 7           8       2013
## 10 WOODHULL HOSPITAL   BLACK                18           7       2013
##    Issue_WDay
##    <chr>     
##  1 Sun       
##  2 Sun       
##  3 Mon       
##  4 Mon       
##  5 Thu       
##  6 Sun       
##  7 Sun       
##  8 Wed       
##  9 Wed       
## 10 Thu       
## # ... with more rows

Use dplyr with SQLite

addr <- nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
  filter(Violation_Precinct >= 1, Violation_Precinct <= 34)
addr
## # Source:   lazy query [?? x 5]
## # Database: sqlite 3.22.0
## #   [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
##    Issue_Date Issuing_Agency Violation_Precin… House_Number Street_Name   
##         <dbl> <chr>                      <int> <chr>        <chr>         
##  1      15921 P                             33 712          W 175 ST      
##  2      15921 P                             33 201          W 177 ST      
##  3      15922 P                             33 520          W 163 ST      
##  4      15922 P                             33 517          W 176 ST      
##  5      15925 P                             33 590          W 174 ST      
##  6      15928 P                             33 525          W 175 ST      
##  7      15928 P                             33 551          W 175 ST      
##  8      15891 P                             13 59           W 21 STREET   
##  9      15906 T                             28 <NA>         N/S WILLOUGHB…
## 10      15904 T                              1 <NA>         <NA>          
## # ... with more rows

SQL query

class(addr)
## [1] "tbl_dbi"  "tbl_sql"  "tbl_lazy" "tbl"
show_query(addr)
## <SQL>
## SELECT *
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `House_Number`, `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))

Mutate and summarise

addr %>% mutate(address = paste(House_Number, Street_Name))
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.22.0
## #   [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
##    Issue_Date Issuing_Agency Violation_Preci… House_Number Street_Name
##         <dbl> <chr>                     <int> <chr>        <chr>      
##  1      15921 P                            33 712          W 175 ST   
##  2      15921 P                            33 201          W 177 ST   
##  3      15922 P                            33 520          W 163 ST   
##  4      15922 P                            33 517          W 176 ST   
##  5      15925 P                            33 590          W 174 ST   
##  6      15928 P                            33 525          W 175 ST   
##  7      15928 P                            33 551          W 175 ST   
##  8      15891 P                            13 59           W 21 STREET
##  9      15906 T                            28 <NA>         N/S WILLOU…
## 10      15904 T                             1 <NA>         <NA>       
## # ... with more rows, and 1 more variable: address <chr>
addr %>% summarize(mean = mean(Violation_Precinct, na.rm = TRUE))
## # Source:   lazy query [?? x 1]
## # Database: sqlite 3.22.0
## #   [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
##    mean
##   <dbl>
## 1  16.1

SQL grouping

addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n())
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0
## #   [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##    <chr>                       <int> <int>
##  1 A                               1    13
##  2 A                               7     1
##  3 A                              10    24
##  4 A                              11     1
##  5 A                              14    47
##  6 A                              33    11
##  7 B                              25     2
##  8 C                               5    73
##  9 C                              13     7
## 10 D                               1     1
## # ... with more rows

SQL query

addr %>% 
  group_by(Issuing_Agency, Violation_Precinct) %>% 
  summarize(n = n()) %>% 
  show_query()
## <SQL>
## SELECT `Issuing_Agency`, `Violation_Precinct`, COUNT() AS `n`
## FROM (SELECT `Issue_Date`, `Issuing_Agency`, `Violation_Precinct`, `House_Number`, `Street_Name`
## FROM `nyc`)
## WHERE ((`Violation_Precinct` >= 1.0) AND (`Violation_Precinct` <= 34.0))
## GROUP BY `Issuing_Agency`, `Violation_Precinct`

SQL translation

dbplyr package (a dplyr backend for databases) has a function, translate_sql, that lets you experiment with how R functions are translated to SQL:

library("dbplyr")
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
translate_sql(x == 1 & (y < 2 | z > 3))
## <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
## <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
## <SQL> "x" % 2.0 = 10.0
translate_sql(paste(x, y))
## <SQL> PASTE("x", "y")
translate_sql(mean(x))
## Warning: Missing values are always removed in SQL.
## Use `avg(x, na.rm = TRUE)` to silence this warning
## <SQL> avg("x") OVER ()
translate_sql(mean(x, na.rm = TRUE))
## <SQL> avg("x") OVER ()

Timings

dplyr using tibble:

system.time(
  nyc %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)
##    user  system elapsed 
##   0.340   0.090   0.436

dplyr using SQLite:

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n())
)
##    user  system elapsed 
##   0.005   0.000   0.004

nyc_sql was 30x times faster than nyc, but the former is disk based while the latter is in memory. Why is the discrepancy?

Laziness

dplyr/dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

nyc_sql %>%
  select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
  filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
  group_by(Issuing_Agency, Violation_Precinct) %>%
  summarize(n = n())
## # Source:   lazy query [?? x 3]
## # Database: sqlite 3.22.0
## #   [/Users/jhwon/Dropbox/class/326.621A/2018/datasci/lectures/11-dbplyr/NYParkingViolations.sqlite]
## # Groups:   Issuing_Agency
##    Issuing_Agency Violation_Precinct     n
##    <chr>                       <int> <int>
##  1 A                               1    13
##  2 A                               7     1
##  3 A                              10    24
##  4 A                              11     1
##  5 A                              14    47
##  6 A                              33    11
##  7 B                              25     2
##  8 C                               5    73
##  9 C                              13     7
## 10 D                               1     1
## # ... with more rows

Full query

To force a full query and return a complete table it is necessary to use the collect function.

system.time(
  nyc_sql %>%
    select(Issue_Date, Issuing_Agency, Violation_Precinct, House_Number, Street_Name) %>%
    filter(Violation_Precinct >=1, Violation_Precinct <= 34) %>%
    group_by(Issuing_Agency, Violation_Precinct) %>%
    summarize(n = n()) %>%
    collect()
)
##    user  system elapsed 
##   3.453   0.341   3.830

Plotting

Bar plot

Suppose we want the bar plot of number of tickets on each weekday. We encounter error with the usual ggplot2 command:

nyc_sql %>%
  ggplot() + 
    geom_bar(aes(x = Issue_WDay))

This is because ggplot2 needs to compute the count per category by going through all the rows. But here nyc_sql is just a pointer to the SQLite table. We have to use the transform in database, plot in R strategy.

nyc_sql %>%
  group_by(Issue_WDay) %>%
  count() %>%
  collect() %>%
  ggplot() + 
    geom_col(aes(x = Issue_WDay, y = n))

Other plots

For example of making histogram and raster plot, read tutorial https://db.rstudio.com/best-practices/visualization/.

Close connection to database

dbDisconnect(db)