Lecture 4: Exploratory Data Analysis

Introduction

Goal of EDA is to develop an understanding of your data.

EDA is an iterative cycle:

Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.

Data cleaning: an application of EDA

ask questions about whether your data meets your expectations or not.
need to deploy all the tools of EDA: visualisation, transformation, and modelling.

Questions

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

Each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

Some definitions

A variable is a quantity, quality, or property that you can measure.
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An observation (data point) is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable.
Tabular data is a set of values, each associated with a variable and an observation.

Variation

Tendency of the values of a variable to change from measurement to measurement.

Examples

If you measure any continuous variable twice, you will get two different results.
Each of your measurements will include a small amount of error that varies from measurement to measurement.
Categorical variables can also vary if you measure across
- different subjects (e.g. the eye colors of different people)
- different times (e.g. the energy levels of an electron at different moments)

Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.

Visualising variations

Categorical variables – bar chart:

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

Manual computation:

    diamonds %>% dplyr::count(cut)

Continuous variables (e.g., numbers, date-times) – histogram:

ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Manual computation:

  diamonds %>% count(ggplot2::cut_width(carat, 0.5))

The variable is on the \(x\)-axis. The numbers of observation that falls into each of the equally spaced bins is on the \(y\)-axis.
Set binwidth:

  smaller <- diamonds %>% 
    filter(carat < 3)  # zoom into just the diamonds with a size of less than three carats
  
  ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.1)

Overlaying multiple histograms in the same plot:

  ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
    geom_freqpoly(binwidth = 0.1)     # use geom_freqploy() instead of geom_histogram()

Typical values

Turning bar charts or histograms into questions:

Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?

ggplot(data = smaller, mapping = aes(x = carat)) +
  geom_histogram(binwidth = 0.01)

Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
Why are there no diamonds bigger than 3 carats?

Subgroups

Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
1. How are the observations within each cluster similar to each other?
2. How are the observations in separate clusters different from each other?
3. How can you explain or describe the clusters?
4. Why might the appearance of clusters be misleading?
Length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park:

    ggplot(data = faithful, mapping = aes(x = eruptions)) + 
        geom_histogram(binwidth = 0.25)

Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.

Unusual values

Why are the limits on the x-axis are so wide?

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5)

There are so many observations in the common bins that the rare bins are so short that you can’t see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values,

Let’s zoom to small values of the y-axis with coord_cartesian():

ggplot(diamonds) + 
  geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
  coord_cartesian(ylim = c(0, 50))

This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:

unusual <- diamonds %>% 
  filter(y < 3 | y > 20) %>% 
  select(price, x, y, z) %>%
  arrange(y)
unusual

The y variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible (why?)

Sometimes outliers are data entry errors; other times outliers suggest important new science.
EDA practice
1. Repeat your analysis with and without the outliers.
2. If they have minimal effect on the results, and you can’t figure out why they’re there, it’s reasonable to replace them with missing values, and move on.
3. If they have a substantial effect on your results, you shouldn’t drop them without justification. You’ll need to figure out what caused them (e.g. a data entry error) and disclose that you removed them in your write-up.

Missing values

If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, replace the unusual values with missing values.

diamonds2 <- diamonds %>% 
      mutate(y = ifelse(y < 3 | y > 20, NA, y))

`ifelse()`

Has three arguments.
The first argument test should be a logical vector.
The result will contain the value of the second argument, yes, when test is TRUE, and the value of the third argument, no, when it is false.

Missing values in ggplot2

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point()

## Warning: Removed 9 rows containing missing values (geom_point).

To suppress that warning, set na.rm = TRUE:

ggplot(data = diamonds2, mapping = aes(x = x, y = y)) + 
  geom_point(na.rm = TRUE)

When missing values have some meaning

In nycflights13::flights, missing values in the dep_time variable indicate that the flight was cancelled.

Compare the scheduled departure times for cancelled and non-cancelled times:

nycflights13::flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

This plot isn’t great because there are many more non-cancelled flights than cancelled flights. More on this later.

Covariation

Tendency for the values of two or more variables to vary together in a related way.

cf. Variation describes the behavior within a variable, covariation describes the behavior between variables.

The best way to spot covariation is to visualise the relationship between two or more variables.

A categorical and a continuous variable

The default appearance of geom_freqpoly() is not that useful:

ggplot(data = diamonds, mapping = aes(x = price)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

Instead of displaying count, we’ll display density, or the count standardised so that the area under each frequency polygon is one:

ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + 
  geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

It appears that fair diamonds (the lowest quality) have the highest average price! We need a better visualization of the distribution to investigate this observation.

Boxplot

Let’s take a look at the distribution of price by cut using geom_boxplot():

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
  geom_boxplot()

It supports the counterintuitive finding that better quality diamonds are cheaper on average! To figure out why is your homework.

`reorder()`

Many categorical variables don’t have an intrinsic order as cut:

# reorder `class` based on the median value of `hwy`
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

This plot makes the trend easier to see than:

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

Two categorical variables

Count the number of observations for each combination.

One way to do that is to rely on the built-in geom_count():

ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

Another approach is to compute the count with dplyr:

diamonds %>% 
  count(color, cut)

Then visualise with geom_tile() and the fill aesthetic:

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

Two continuous variables

Scatterplot:

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

Large-size data, use alpha:

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

Very large dataset, use bin:

ggplot(data = smaller) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
# library(hexbin)
ggplot(data = smaller) +
  geom_hex(mapping = aes(x = carat, y = price))

Bin one continuous variable so it acts like a categorical variable:

ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, width=0.1)))

# approximately the same number of points in each bin
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_number(carat, n=20)))

Patterns and models

Patterns provide clues about relationships

If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:

Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?

A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.

ggplot(data = faithful) + 
  geom_point(mapping = aes(x = eruptions, y = waiting))

Patterns reveal covariation.
Variation creates uncertainty; covariation reduces it.
If two variables covary, you can use the values of one variable to make better predictions about the values of the second.
If the covariation is due to a causal relationship (a special case), then you can use the value of one variable to control the value of the second.

Models extract patterns out of data

In the diamonds data, it is hard to understand the relationship between cut and price, because cut and carat, and carat and price are tightly related.
It is possible to use a model to remove the very strong relationship between price and carat so we can explore the subtleties that remain.

library(modelr)
# model: assume exponential relation between `price` and `carat`
mod <- lm(log(price) ~ log(carat), data = diamonds)

diamonds2 <- diamonds %>% 
  add_residuals(mod) %>% 
  mutate(resid = exp(resid))  # residuals of the model

ggplot(data = diamonds2) + 
  geom_point(mapping = aes(x = carat, y = resid))

Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.

ggplot(data = diamonds2) + 
  geom_boxplot(mapping = aes(x = cut, y = resid))