Acknowledgment

This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.

Hadoop ecosystem for big data

Hadoop, HDFS, MapReduce, YARN, Spark, Hive, Pig, …

Tutorial

In this tutorial, we learn to:

It is adapted from the tutorial at RStudio on setting up YARN cluster on AWS. GCP Dataproc is the equivalent of EMR in AWS.

Set up a YARN cluster in GCP

In GCP console, click Dataproc -> Clusters to create a YARN cluster.

Settings for the YARN cluster are mostly self-explanatory:

Connect to the YARN manager

Click the SSH button next to the master node brings up a browser terminal that logs in the YARN master node as the (super)user YOUR_GOOGLE_ACCOUNT_NAME:

Create a user

Create a user called rstudio-user that will perform the data analysis. Create a user directory for rstudio-user on HDFS with the hadoop fs command.

# Make User
sudo useradd -m rstudio-user
sudo passwd rstudio-user
# change login shell to bash permanently
sudo chsh -s /bin/bash rstudio-user

# Create new directory in hdfs
hadoop fs -mkdir /user/rstudio-user
hadoop fs -chmod 777 /user/rstudio-user

See Hadoop FS reference for a list of hadoop fs commands.

Download the flights data

The flights data is a well known data source representing 123 million flights over 22 years. It consumes roughly 12 GiB of storage in uncompressed CSV format in yearly files.

Switch user. For data loading and analysis, make sure you are logged in as regular user.

# switch user
su - rstudio-user
# change to bash shell 
#exec bash -l
# change login shell to bash permanently
#chsh -s /bin/bash rstudio-user

Run the following script to download data from the web onto your master node. This script downloads the yearly flight data and the airlines lookup table.

# Make download directory
mkdir /tmp/flights

# Download airline carrier data
wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS

# Download airports data
wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

# Download flight data by year
for i in {1987..2008}
  do
    echo "$(date) $i Download"
    fnam=$i.csv.bz2
    wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam
    echo "$(date) $i Unzip"
    bunzip2 /tmp/flights/$fnam
  done

Save the script as download.sh and do

chmod +x download.sh
./download.sh

Peek into the data:

head /tmp/flights/1987.csv 
head /tmp/airlines.csv
head /tmp/airports.csv

Distribute into HDFS

Copy data into HDFS using the hadoop fs command.

# Copy flight data to HDFS
hadoop fs -mkdir /user/rstudio-user/flights/
hadoop fs -put /tmp/flights /user/rstudio-user/

# Copy airline data to HDFS
hadoop fs -mkdir /user/rstudio-user/airlines/
hadoop fs -put /tmp/airlines.csv /user/rstudio-user/airlines

# Copy airport data to HDFS
hadoop fs -mkdir /user/rstudio-user/airports/
hadoop fs -put /tmp/airports.csv /user/rstudio-user/airports

Create Hive tables

Launch Hive from the command line.

# Open Hive prompt
hive

Create the metadata that will structure the flights table. Load data into the Hive table.

# Create metadata for flights
CREATE EXTERNAL TABLE IF NOT EXISTS flights
(
year int,
month int,
dayofmonth int,
dayofweek int,
deptime int,
crsdeptime int,
arrtime int, 
crsarrtime int,
uniquecarrier string,
flightnum int,
tailnum string, 
actualelapsedtime int,
crselapsedtime int,
airtime string,
arrdelay int,
depdelay int, 
origin string,
dest string,
distance int,
taxiin string,
taxiout string,
cancelled int,
cancellationcode string,
diverted int,
carrierdelay string,
weatherdelay string,
nasdelay string,
securitydelay string,
lateaircraftdelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");

# Load data into table
LOAD DATA INPATH '/user/rstudio-user/flights' INTO TABLE flights;

These hive commands will move the data from /user/rstudio-user/flights to /user/hive/warehouse/flights:

hadoop fs -ls /user/rstudio-user/flights
hadoop fs -ls /user/hive/warehouse/flights

Create the metadata that will structure the airlines table. Load data into the Hive table.

# Create metadata for airlines
CREATE EXTERNAL TABLE IF NOT EXISTS airlines
(
Code string,
Description string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar"     = '\"'
)
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");

# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airlines' INTO TABLE airlines;

These hive commands will move the data from /user/rstudio-user/airlines to /user/hive/warehouse/airlines.

hadoop fs -ls /user/rstudio-user/airlines
hadoop fs -ls /user/hive/warehouse/airlines

Create the metadata that will structure the airports table. Load data into the Hive table.

# Create metadata for airports
CREATE EXTERNAL TABLE IF NOT EXISTS airports
(
id string,
name string,
city string,
country string,
faa string,
icao string,
lat double,
lon double,
alt int,
tz_offset double,
dst string,
tz_name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar"     = '\"'
)
STORED AS TEXTFILE;

# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airports' INTO TABLE airports;

These hive commands will move the data from /user/rstudio-user/airports to /user/hive/warehouse/airports.

hadoop fs -ls /user/rstudio-user/airports
hadoop fs -ls /user/hive/warehouse/airport

See Hive language reference for Hive syntax.

Install R/RStudio Server on master node (Debian)

apt-get is the package manager on Debian system.

Install dependencies:

# Update
sudo apt-get update
# used for devtools
sudo apt-get install libcurl4-openssl-dev libssl-dev libxml2-dev

Update R on master node:

sudo apt-get install r-base r-base-dev

Install commonly used R packages so they’re globally available to all users:

sudo R -e "install.packages(c('tidyverse', 'DBI', 'sparklyr', 'caTools', 'bitops'), repos='https://cran.rstudio.com/')"

Install RStudio Server on Debian 9:

#  simple tool to install deb files
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/rstudio-server-stretch-1.1.463-amd64.deb
sudo gdebi rstudio-server-stretch-1.1.463-amd64.deb

RStudio Server will be automatically started upon successful installation.

By default, port 8787 used by R Studio Server is blocked by VM firewall. On GCP console, go to VPC network and then Firewall rules, create a rule for R Studio Server (tcp: 8787), apply that rule to the YARN master node.

Update To increase security, in VPC Networks > Firewall rules, find a rule associated with port 8787, set Source filter as “IP ranges”, and set Source IP ranges as 147.46.0.0/16 and 147.47.0.0/16 in order to allow connections from the SNU network only.

Now you should be able to access R Studio Server on YARN master node by pointing your browser to the IP address of your master node, say http://35.187.203.86:8787.

Establish regular accounts

  • As before, create a regular user, say wonj:
sudo useradd -m wonj

The -m option creates the home folder /home/wonj.

  • You can set password for a user by
sudo passwd wonj
  • If you want to enable the regular user as a sudoer, add it into the sudo group:
sudo usermod -aG sudo wonj
sudo chsh -s /bin/bash wonj
su - wonj

To connect to Rtudio Server, you need password authentication. As before, edit /etc/ssh/sshd_config file by

sudo vi /etc/ssh/sshd_config

and uncomment the line

#PasswordAuthentication yes 

Don’t forget to comment out the line

PasswordAuthentication no 

Restart the ssh daemon by

sudo service ssh restart

Now you should be able to log in using the username rstudio-user and the password you created.

  • Update To increase security, in VPC Networks > Firewall rules, find a rule associated with SSH (port 22), set Source filter as “IP ranges”, and set Source IP ranges as 147.46.0.0/16 and 147.47.0.0/16 in order to allow connections from the SNU network only.

YARN management survival commands

  • We can check the status of YARN cluster at port 8088: e.g., http://35.187.203.86:8088.
  • Remember to unblock the firewall
  • Update To increase security, in VPC Networks > Firewall rules, find a rule associated with port 8088, set Source filter as “IP ranges”, and set Source IP ranges as 147.46.0.0/16 and 147.47.0.0/16 in order to allow connections from the SNU network only.
# list hadoop jobs
yarn application -list
# kill hadoop job
yarn application -kill APPLICATION_ID