This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.
Using a collection of remote servers for computation, data storage/manipulation, etc.
Pay for clock-cycles, storage, and network flow rather than hardware.
Scalability!
Adapt to fluctuating demand:
Websites with fluctuating traffic
large corporations use much more computing during business hours than off-business hours
Efficiency
Pay for what you need
No need for hardware maintenance
Less waiting for fixed compute-time jobs
Our computational demands often fluctuate dramatically.
There are many vendors out there. Good for customers (us). They all work similarly.
Amazon web services (AWS).
Google cloud platform (GCP).
Microsoft Azure.
IBM cloud.
We will demonstrate how to start using GCP.
Set up GCP account.
Configure and launch VM instance(s).
Set up connection (SSH key).
Install softwares you need.
Run your jobs.
Transfer result.
Terminate instance(s).
Go to https://cloud.google.com.
Make use you sign in using your Google account. You may want to sign out your existing Google accounts and create a new one before next steps.
If you click Try it free
at https://cloud.google.com and fill out requisite information, you will get $300 credit which expires in 1 year. You do not need to use it for this course. But it’s better to claim the credit before redeeming the coupon below.
You will be asked for a name and email address.
After email verification and redeeming the coupon, a project My First Project
is created in GCP.
GCP free trial:
Some resources are always free: 1 f1-micro VM instance, 30GB of standard persistent disk storage, etc.
General pricing can be found on this page.
Go to Compute Engine, click CREATE INSTANCE
.
Give a meaningful name, e.g, snustat-326_621a-2018
.
Choose asia-northeast-b
zone.
Machine type: 2 vCPUs, 7.5GB memory should suffice for this course.
Boot disk: CentOS 7, standard persistent disk (or SSD) 25GB should sufficie for this course.
These settings can be changed anytime. Typical paradigm: develop code using an inexpensive machine type and switch to a powerful one when running computation intensive tasks.
Click Create
.
At the VM Instances
page, you can see a list of all instances in your project and their IP addresses. We use the external IP address, e.g., 35.187.203.86
, for SSH connection.
Note if we stop the instance and start again. The external IP address may change. To keep a fixed external IP address, go to VPC network
then External IP addresses
and make the desired external IP address static
(vs ephemeral). Note that if no instance is using a static
IP address, you will be charged for the idle static
IP.
There are several ways to connect to the VM instance you just created. Most often we want to be able to SSH into the VM instance from other machines, e.g., your own laptop. By default, VM instance only accept key authentication. So it’s necessary to set up the SSH key first.
Click the SSH
button on VM Instances
page will bring out a terminal in browser as super user, e.g., johann_won
(your gmail account name).
In that browser, you can set up your own SSH keys:
cd
mkdir .ssh
chmod go-rx .ssh/
cd .ssh
vi authorized_keys
Copy your public key to authorized_keys
and set permission
chmod go-rwx authorized_keys
If you don’t have a public key, then you may want try password authentication. To do this, edit /etc/ssh/sshd_config
file by
sudo vi /etc/ssh/sshd_config
and uncomment the line
#PasswordAuthentication yes
Don’t forget to comment out the line
PasswordAuthentication no
Then, add password to your account by
sudo password $(YOUR_GOOGLE_ACCOUNT_NAME)
Finally, restart the ssh daemon by
sudo service sshd restart
ssh $(YOUR_GOOGLE_ACCOUNT_NAME)@35.187.203.86 ## actual ip address should be different
yum
is the default package management tool on CentOS. Most software can be installed via sudo yum
. sudo
executes a command as a superuser (or root).
sudo yum install epel-release -y
sudo yum update -y
sudo yum install R -y
wget
, which is a command line tool for downloading files from internet.sudo yum install wget -y
wget https://download2.rstudio.org/rstudio-server-rhel-1.1.463-x86_64.rpm
sudo yum install rstudio-server-rhel-1.1.463-x86_64.rpm
rm rstudio-server-rhel-1.1.463-x86_64.rpm
sudo systemctl status rstudio-server.service
By default, port 8787 used by R Studio Server is blocked by VM firewall. On GCP console, go to VPC network
and then Firewall rules
, create a rule for R Studio Server (tcp: 8787
), apply that rule to your VM instance.
Now you should be able to access R Studio Server on the VM instance by pointing your browser to address http://35.187.203.86:8787
.
Key authentication suffices for most applications.
Unfortunately R Studio Server (open source edition) does not support key authentication. That implies if you want to use R Studio Server on the VM Instance, you need to enable username/password authentication.
As super user e.g. johann_won
, you can create a regular user, say wonj
:
sudo useradd -m wonj
The -m
option creates the home folder /home/wonj
.
sudo passwd wonj
Now you should be able to log in the R Studio Server from browser http://35.187.203.86:8787
using username wonj
and corresponding password.
To SSH into VM instance as the regular user wonj
, you need to set up the key (similar to set up key for superuser).
If you want to enable the regular user as a sudoer, add it into the wheel
group:
su - johann_won
sudo usermod -aG wheel wonj
su - wonj
When installing R packages, it often fails because certain Linux libraries are absent.
Pay attention to the error messages, and install those libraries using yum
.
E.g., try installing tidyverse
may yield following errors
ERROR: dependencies ‘httr’, ‘rvest’, ‘xml2’ are not available for package ‘tidyverse’
* removing ‘/home/wonj/R/x86_64-redhat-linux-gnu-library/3.4/tidyverse’
You can install these Linux dependencies curl
, openssl
, and libxml2
by:
sudo yum install curl curl-devel -y
sudo yum install openssl openssl-devel -y
sudo yum install libxml2 libxml2-devel -y
sudo yum install git -y
For smooth Gitting, you need to put the private key matching the public key in your GitHub account in the ~/.ssh
folder on the VM instance.
Now you can git clone
any repo to the VM instance to start working on a project. E.g.,
git clone https://github.com/snu-stat/stat326_621a-2018-hw2
Now you have R and R Studio on the VM instance.
Simpliest way to synchronize your project files across machines is Git, e.g.,
git clone https://github.com/snu-stat/stat326_621a-2018-hw2
Set up and run your jobs as usual.
You can check CPU usage on the GCP console.
You can set notification when CPU usage falls below a threshold (so you know the job is done).
Using cloud (AWS, Azure, GCP, …) is easy.
Easy to launch cluster instances or other heavily customized instances (SQL server, BigQuery, ML engine, Genomics, …).
Massive computing at your fingertips.