This course covers how to collect, process, analyze, and present data and the subsequent results by means of statistical programming.
Above goals are common with so-called “data science”, whose precise definition is still unclear:
Import: take data stored in a file, database, or web API, and load it into a data frame in R.
Tidy: store data in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.
Transform:
Communication: It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
Programming
We will closely follow the textbook:
Following the order of the data science pipeline is not the best way to learn the tools, according to the authors.
Exercises at the end of sections will be assigned as homework, or covered in the labs.
There’s no better way to learn than practicing on real problems.
Big data
Python, Julia, and friends
Non-rectangular data
We focus exclusively on rectangular data, because rectangular data frames are extremely common in science and industry, and they are a great place to start your data science journey.
We do not cover:
Hypothesis confirmation
Download the latest version of R from Comprehensive R Archive Network, or CRAN. Use the cloud mirror, https://cloud.r-project.org, which automatically figures the closest mirror server out for you.
Integrated development environment (IDE) for R programming. Download and install it from http://www.rstudio.com/download. Make sure you have RStudio 1.0.0 or later.
For now, all you need to know is that you type R code in the console pane, and press enter to run it.
An R package (=collection of functions, data, and documentation that extends the capabilities of base R) whose components share a common philosophy of data and R programming, and are designed to work together naturally.
install.packages("tidyverse")
On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer.
You will not be able to use the functions, objects, and help files in a package until you load it with library()
. Once you have installed a package, you can load it with the library()
function:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.
Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update()
.
In this course we’ll use three data packages from outside the tidyverse:
install.packages(c("nycflights13", "gapminder", "Lahman"))
A very simple R code:
> 1 + 2
[1] 3
Notation:
Functions are in a code font and followed by parentheses, like sum()
, or mean()
.
Other R objects (like data or function arguments) are in a code font, without parentheses, like flights
or x
.
If we want to make it clear what package an object comes from, we’ll use the package name followed by two colons, like dplyr::mutate()
, or
nycflights13::flights
. This is also valid R code.
Sys.setenv(LANGUAGE = "en")
and re-run the code; you’re more likely to find help for English error messages.[R]
to restrict your search to questions and answers that use R.Use utils::sessionInfo()
or devtools::session_info()
to reveal to version of R and the platform.
Packages should be loaded at the top of the script.
Include data by using base::dput()
. Find the smallest subset of your data that still reveals the problem.
Spend a little bit of time ensuring that your code is easy for others to read. Do your best to remove everything that is not related to the problem.
#rstats
hashtag.