M1399.000200: Advanced Statistical Computing

What is statistics?

  • Statistics, the science of data analysis, is the applied mathematics in the 21st century.

  • People (scientists, goverment, health professionals, companies) collect data in order to answer certain questions. Statisticians's job is to help them extract knowledge and insights from data.

  • Must-read for students of statistics:

  • If existing software tools readily solve the problem, use them.

  • Often statisticians need to implement their own methods, test new algorithms, or tailor classical methods to new types of data (big, streaming).

  • This entails at least two essential skills: programming and fundamental knowledge of algorithms.

What is this course about?

  • Not a course on statistical packages. It does not answer questions such as How to fit a linear mixed model in R, Julia, SAS, SPSS, or Stata?

  • Not a pure programming course, although programming is important and we do homework in Julia.
    Undergraduate course 326.312 (Statistical Computing and Labs), taught concurrently in this semester, focuses on programming in R.

  • Not a course on data science. My previous course 326.621a-2018 (Introduction to Data Science) focused on some software tools for data scientists.

  • This course focuses on algorithms, mostly those in numerical linear algebra and numerical optimization.

  • To quote James Gentle

    The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

  • For a common numerical task in statistics, say solving the least squares problem $$ \widehat \beta = ({\bf X}^T {\bf X})^{-1} {\bf X}^T {\bf y}, $$ we need to know which methods/algorithms are out there and what are their advantages and disadvantages. You will fail this course if you use

    inv(X' * X) * X' * y
    

    Using X \ y in Julia/Matlab (or solve(X, y) in R) is correct but not the purpose of this course. We want to understand what computer is doing when calling X \ y.

Course logistics

Acknowledgment

This lecture note has evolved from Dr. Hua Zhou's 2019 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2019spring/index.html.