Goal of EDA is to develop an understanding of your data.
EDA is an iterative cycle:
Generate questions about your data.
Search for answers by visualising, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.
During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Some of these ideas will pan out, and some will be dead ends.
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
Each new question that you ask will expose you to a new aspect of your data and increase your chance of making a discovery.
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
A variable is a quantity, quality, or property that you can measure.
A value is the state of a variable when you measure it. The value of a variable may change from measurement to measurement.
An observation (data point) is a set of measurements made under similar conditions (you usually make all of the measurements in an observation at the same time and on the same object). An observation will contain several values, each associated with a different variable.
Tabular data is a set of values, each associated with a variable and an observation.
Tendency of the values of a variable to change from measurement to measurement.
If you measure any continuous variable twice, you will get two different results.
Each of your measurements will include a small amount of error that varies from measurement to measurement.
Categorical variables can also vary if you measure across
Every variable has its own pattern of variation, which can reveal interesting information. The best way to understand that pattern is to visualise the distribution of the variable’s values.
Categorical variables – bar chart:
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
diamonds %>% dplyr::count(cut)
Continuous variables (e.g., numbers, date-times) – histogram:
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
diamonds %>% count(ggplot2::cut_width(carat, 0.5))
The variable is on the \(x\)-axis. The numbers of observation that falls into each of the equally spaced bins is on the \(y\)-axis.
Set binwidth:
smaller <- diamonds %>%
filter(carat < 3) # zoom into just the diamonds with a size of less than three carats
ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.1)
ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1) # use geom_freqploy() instead of geom_histogram()
Turning bar charts or histograms into questions:
Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)
Why are there more diamonds at whole carats and common fractions of carats?
Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?
Why are there no diamonds bigger than 3 carats?
Clusters of similar values suggest that subgroups exist in your data. To understand the subgroups, ask:
How are the observations within each cluster similar to each other?
How are the observations in separate clusters different from each other?
How can you explain or describe the clusters?
Why might the appearance of clusters be misleading?
Length (in minutes) of 272 eruptions of the Old Faithful Geyser in Yellowstone National Park:
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)
Eruption times appear to be clustered into two groups: there are short eruptions (of around 2 minutes) and long eruptions (4-5 minutes), but little in between.
Why are the limits on the x-axis are so wide?
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)
There are so many observations in the common bins that the rare bins are so short that you can’t see them (although maybe if you stare intently at 0 you’ll spot something). To make it easy to see the unusual values,
Let’s zoom to small values of the y-axis with coord_cartesian()
:
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))
This allows us to see that there are three unusual values: 0, ~30, and ~60. We pluck them out with dplyr:
unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
The y
variable measures one of the three dimensions of these diamonds, in mm. We know that diamonds can’t have a width of 0mm, so these values must be incorrect. We might also suspect that measurements of 32mm and 59mm are implausible (why?)
Sometimes outliers are data entry errors; other times outliers suggest important new science.
EDA practice
If you’ve encountered unusual values in your dataset, and simply want to move on to the rest of your analysis, replace the unusual values with missing values.
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
ifelse()
test
should be a logical vector.yes
, when test
is TRUE
, and the value of the third argument, no
, when it is false.ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()
## Warning: Removed 9 rows containing missing values (geom_point).
To suppress that warning, set na.rm = TRUE
:
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)
In nycflights13::flights
, missing values in the dep_time
variable indicate that the flight was cancelled.
Compare the scheduled departure times for cancelled and non-cancelled times:
nycflights13::flights %>%
mutate(
cancelled = is.na(dep_time),
sched_hour = sched_dep_time %/% 100,
sched_min = sched_dep_time %% 100,
sched_dep_time = sched_hour + sched_min / 60
) %>%
ggplot(mapping = aes(sched_dep_time)) +
geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)
This plot isn’t great because there are many more non-cancelled flights than cancelled flights. More on this later.
Tendency for the values of two or more variables to vary together in a related way.
cf. Variation describes the behavior within a variable, covariation describes the behavior between variables.
The best way to spot covariation is to visualise the relationship between two or more variables.
The default appearance of geom_freqpoly()
is not that useful:
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
Instead of displaying count, we’ll display density, or the count standardised so that the area under each frequency polygon is one:
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
It appears that fair diamonds (the lowest quality) have the highest average price! We need a better visualization of the distribution to investigate this observation.
Let’s take a look at the distribution of price by cut using geom_boxplot()
:
ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()
It supports the counterintuitive finding that better quality diamonds are cheaper on average! To figure out why is your homework.
reorder()
Many categorical variables don’t have an intrinsic order as cut
:
# reorder `class` based on the median value of `hwy`
ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
This plot makes the trend easier to see than:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
Count the number of observations for each combination.
One way to do that is to rely on the built-in geom_count()
:
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))
Another approach is to compute the count with dplyr:
diamonds %>%
count(color, cut)
Then visualise with geom_tile()
and the fill aesthetic:
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))
Scatterplot:
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
Large-size data, use alpha
:
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
Very large dataset, use bin:
ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))
# install.packages("hexbin")
# library(hexbin)
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))
Bin one continuous variable so it acts like a categorical variable:
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, width=0.1)))
# approximately the same number of points in each bin
ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, n=20)))
If a systematic relationship exists between two variables it will appear as a pattern in the data. If you spot a pattern, ask yourself:
Could this pattern be due to coincidence (i.e. random chance)?
How can you describe the relationship implied by the pattern?
How strong is the relationship implied by the pattern?
What other variables might affect the relationship?
Does the relationship change if you look at individual subgroups of the data?
A scatterplot of Old Faithful eruption lengths versus the wait time between eruptions shows a pattern: longer wait times are associated with longer eruptions. The scatterplot also displays the two clusters that we noticed above.
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))
library(modelr)
# model: assume exponential relation between `price` and `carat`
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid)) # residuals of the model
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))
Once you’ve removed the strong relationship between carat and price, you can see what you expect in the relationship between cut and price: relative to their size, better quality diamonds are more expensive.
ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))
More on models in Part IV.