About this course

This course covers how to collect, process, analyze, and present data and the subsequent results by means of statistical programming.
Above goals are common with so-called “data science”, whose precise definition is still unclear:

Import: take data stored in a file, database, or web API, and load it into a data frame in R.
Tidy: store data in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.
Transform:
- Narrowing in on observations of interest (e.g., all people in one city, or all data from the last year)
- Creating new variables that are functions of existing variables (e.g., computing velocity from speed and time)
- Calculating a set of summary statistics (e.g., counts or means).

Together, tidying and transforming are called wrangling.

Visualisation: good visualisation will show you things that you did not expect, or raise new questions about the data.
- Don’t scale particularly well because they require a human to interpret them.
Models: once you have made your questions sufficiently precise, you can use a model to answer them.
- Fundamentally mathematical or computational tool – generally scale well.

Any real analysis will iterate between them many times.

Communication: It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
Programming

Surrounding all these tools. Cross-cutting tool that you use in every part of the project.
You don’t need to be an expert programmer to be a data scientist, but learning more about programming pays off.

Organisation of the course

We will closely follow the textbook:

Visualisation and transformation
Wrangling
Programming
Modelling

Following the order of the data science pipeline is not the best way to learn the tools, according to the authors.

Exercises at the end of sections will be assigned as homework, or covered in the labs.
There’s no better way to learn than practicing on real problems.

What you won’t learn

Big data
- We will focus on small, in-memory datasets. You can’t tackle big data unless you have experience with small data.
Python, Julia, and friends
- We will stick to R.
Non-rectangular data
- We focus exclusively on rectangular data, because rectangular data frames are extremely common in science and industry, and they are a great place to start your data science journey.
- We do not cover:
  - images, sounds, trees, and text.
Hypothesis confirmation
- Two data analysis camps: hypothesis generation (exploratory data analysis) vs. hypothesis confirmation (confirmatory analysis, or inference).
- The focus of this course is on hypothesis generation, or data exploration.
- You will learn a lot about hypothesis generation over the course of your statistics major.

Prerequisites

Numerical literacy
Minimal exposure to R (from courses such as 033.019, 033.020, 033.021)

R

Download the latest version of R from Comprehensive R Archive Network, or CRAN. Use the cloud mirror, https://cloud.r-project.org, which automatically figures the closest mirror server out for you.

RStudio

Integrated development environment (IDE) for R programming. Download and install it from http://www.rstudio.com/download. Make sure you have RStudio 1.0.0 or later.

For now, all you need to know is that you type R code in the console pane, and press enter to run it.

Tidyverse

An R package (=collection of functions, data, and documentation that extends the capabilities of base R) whose components share a common philosophy of data and R programming, and are designed to work together naturally.

install.packages("tidyverse")

On your own computer, type that line of code in the console, and then press enter to run it. R will download the packages from CRAN and install them on to your computer.

You will not be able to use the functions, objects, and help files in a package until you load it with library(). Once you have installed a package, you can load it with the library() function:

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.0
## ✓ tidyr   1.1.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

This tells you that tidyverse is loading the ggplot2, tibble, tidyr, readr, purrr, and dplyr packages. These are considered to be the core of the tidyverse because you’ll use them in almost every analysis.

Packages in the tidyverse change fairly frequently. You can see if updates are available, and optionally install them, by running tidyverse_update().

Other packages

In this course we’ll use three data packages from outside the tidyverse:

install.packages(c("nycflights13", "gapminder", "Lahman"))

Running R code

A very simple R code:

> 1 + 2
[1] 3

Notation:

Functions are in a code font and followed by parentheses, like sum(), or mean().
Other R objects (like data or function arguments) are in a code font, without parentheses, like flights or x.
If we want to make it clear what package an object comes from, we’ll use the package name followed by two colons, like dplyr::mutate(), or
nycflights13::flights. This is also valid R code.

Getting help and learning more

Google

Add “R” to a query to restrict the query to relevant results.
Particularly useful for error messages: if you get an error message and you have no idea what it means, chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.
If the error message isn’t in English, run Sys.setenv(LANGUAGE = "en") and re-run the code; you’re more likely to find help for English error messages.

Stackoverflow

Start by spending a little time searching for an existing answer, including [R] to restrict your search to questions and answers that use R.
If you don’t find anything useful, prepare a minimal reproducible example or reprex.
4 things to make your example reproducible: required packages, data, and code.
1. Use utils::sessionInfo() or devtools::session_info() to reveal to version of R and the platform.
2. Packages should be loaded at the top of the script.
3. Include data by using base::dput(). Find the smallest subset of your data that still reveals the problem.
4. Spend a little bit of time ensuring that your code is easy for others to read. Do your best to remove everything that is not related to the problem.

Blogs

RStudio blog. This is where we post announcements about new packages, new IDE features, and in-person courses.
http://www.r-bloggers.com aggregates over 500 blogs about R from around the world.

Twitter

Follow the #rstats hashtag.
Follow the authors: Hadley Wickham (@hadleywickham), Garrett Grolemund (@statgarrett)

Lecture 1: Introduction

Joong-Ho Won @ SNU