\(\DeclareMathOperator*{\argmin}{arg\,min}\)

Acknowledgment

This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.

What is this course about?

Statistics and data science

  • This course introduces some computing skills and software tools for handling potentially big data.

  • Statistics, the science of data analysis, is the applied mathematics in the 21st century.

  • Data is increasing in volume, velocity, and variety.

Classification of data sets by Huber (1994); Huber (1996)

Data Size Bytes Storage Mode
tiny \(10^2\) piece of paper
small \(10^4\) a few pieces of paper
medium \(10^6\) (MB) a floppy disk
large \(10^8\) hard disk
huge \(10^9\) (GB) hard disk(s)
massive \(10^{12}\) (TB) hard disk(s); RAID storage

Four V’s of big data

Source: IBM.

Course desciption

  • This course introduces some computing skills and software tools for handling potentially big data.

  • Read syllabus for a tentative list of topics and course logistics.

References

Huber, Peter J. 1994. “Huge Data Sets.” In COMPSTAT 1994 (Vienna), 3–13. Heidelberg: Physica.

———. 1996. “Massive Data Sets Workshop: The Morning After.” In Massive Data Sets: Proceedings of a Workshop, 169–84. Washington: National Academy Press.