Combine tables

nycflights13

library(nycflights13)

We already know a lot about flights:

flights

nycflights13 has >1 tables:

airlines

airports

planes

weather

Relational data

Multiple tables of data are called relational data.

Keys

A key is a variable (or set of variables) that uniquely identifies an observation.

A primary key uniquely identifies an observation in its own table, e.g., planes$tailnum
A foreign key uniquely identifies an observation in another table, e.g., flights$tailnum

Mutating joins

Combine variables (colums) from two tables.

Demo tables

    x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         3, "x3"
    )
    y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2",
         4, "y3"
    )

Inner join

An inner join matches pairs of observations whenever their keys are equal:

    inner_join(x, y, by = "key")

Same as

    x %>% inner_join(y, by = "key")

Outer join

An outer join keeps observations that appear in at least one of the tables.
Three types of outer joins:
- A left join keeps all observations in x.
- A right join keeps all observations in y.
- A full join keeps all observations in x or y.

Duplicate keys

What if the keys are not unique?

One table has duplicate keys.

    x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         2, "x3",
         1, "x4"
    )
    y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2"
    )
    left_join(x, y, by = "key")

Both tables have duplicate keys. You get all possible combinations, the Cartesian product:

    x <- tribble(
      ~key, ~val_x,
         1, "x1",
         2, "x2",
         2, "x3",
         3, "x4"
    )
    y <- tribble(
      ~key, ~val_y,
         1, "y1",
         2, "y2",
         2, "y3",
         3, "y4"
    )
    left_join(x, y, by = "key")

Let’s create a narrower table from the flights data:

    flights2 <- flights %>% 
      select(year:day, hour, origin, dest, tailnum, carrier)
    flights2

Defining the key columns

by = NULL (natural join): use all variables that appear in both tables:

    # same as: flights2 %>% left_join(weather)
    left_join(flights2, weather)

## Joining, by = c("year", "month", "day", "hour", "origin")

by = "x": use the common variable x:

    # same as left_join(flights2, planes, by = "tailnum")
    flights2 %>%  left_join(planes, by = "tailnum")

by = c("a" = "b"): match variable a in table x to the variable b in table y.

    flights2 %>%  left_join(airports, c("dest" = "faa")) # match destination, not origin

    flights2 %>%  left_join(airports, c("origin" = "faa"))

Filtering joins

Combine observations (rows) from two tables.

Top 10 most popular destinations:

    top_dest <- flights %>%
      count(dest, sort = TRUE) %>%
      head(10)
    top_dest

How to filter the flights to these destinations?

Semi-join

semi_join(x, y) only keeps the rows in x that have a match in y.

Useful to see what will be joined.

    semi_join(flights, top_dest)

## Joining, by = "dest"

Anti-join

anti_join(x, y) keeps the rows that don’t have a match.

Useful for diagnosing join mismatches.

    flights %>% 
      anti_join(planes, by = "tailnum") %>%
      count(tailnum, sort = TRUE)

Set operations

Generate two tables:

    df1 <- tribble(
      ~x, ~y,
       1,  1,
       2,  1
    )
    df2 <- tribble(
      ~x, ~y,
       1,  1,
       1,  2
    )

intersect(x, y) returns rows that appear in both x and y.

    intersect(df1, df2)

union(x, y) returns unique observations in x and y.

    union(df1, df2)

setdiff(x, y) returns rows that appear in x but not in y.

    setdiff(df1, df2)

    setdiff(df2, df1)

Lecture 6: Relational Data

Joong-Ho Won @ SNU