This lecture note is based on Dr. Hua Zhou’s 2018 Winter Statistical Computing course notes available at http://hua-zhou.github.io/teaching/biostatm280-2018winter/index.html.
Hadoop, HDFS, MapReduce, YARN, Spark, Hive, Pig, …
Hadoop is the leading open-source software framework developed for scalable, reliable and distributed computing. It implements HDFS (Hadoop’s distributed file system), which facilitates the storage, management and rapid analysis of vast datasets across distributed clusters of servicers.
Hadoop Yarn is programming model for processing and generating large sets of data. Yarn is the successor of Hadoop MapReduce.
Hive was designed to automatically translate SQL-like queries into MapReduce jobs on Hadoop.
Spark is a powerful Hadoop processing engine (SQL queries, streaming data, ML algorithm) that can handle both batch and streaming workloads at lightning fast speeds.
Pig is a platform for analysing large sets of data.
In this tutorial, we learn to:
set up a YARN cluster (1 master node, 2 worker nodes) on GCP
download data files and deposit them into HDFS
creat Hive tables that map to files in HDFS
install R/RStudio on YARN master node
set up regular accounts so students can use RStudio on the master node
It is adapted from the tutorial at RStudio on setting up YARN cluster on AWS. GCP Dataproc
is the equivalent of EMR
in AWS.
Dataproc
-> Clusters
to create a YARN cluster.
Settings for the YARN cluster are mostly self-explanatory:
SSH
button next to the master node brings up a browser terminal that logs in the YARN master node as the (super)user YOUR_GOOGLE_ACCOUNT_NAME
:
Create a user called rstudio-user
that will perform the data analysis. Create a user directory for rstudio-user
on HDFS with the hadoop fs
command.
# Make User
sudo useradd -m rstudio-user
sudo passwd rstudio-user
# change login shell to bash permanently
sudo chsh -s /bin/bash rstudio-user
# Create new directory in hdfs
hadoop fs -mkdir /user/rstudio-user
hadoop fs -chmod 777 /user/rstudio-user
See Hadoop FS reference for a list of hadoop fs
commands.
The flights data is a well known data source representing 123 million flights over 22 years. It consumes roughly 12 GiB of storage in uncompressed CSV format in yearly files.
Switch user. For data loading and analysis, make sure you are logged in as regular user.
# switch user
su - rstudio-user
# change to bash shell
#exec bash -l
# change login shell to bash permanently
#chsh -s /bin/bash rstudio-user
Run the following script to download data from the web onto your master node. This script downloads the yearly flight data and the airlines lookup table.
# Make download directory
mkdir /tmp/flights
# Download airline carrier data
wget -O /tmp/airlines.csv http://www.transtats.bts.gov/Download_Lookup.asp?Lookup=L_UNIQUE_CARRIERS
# Download airports data
wget -O /tmp/airports.csv https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat
# Download flight data by year
for i in {1987..2008}
do
echo "$(date) $i Download"
fnam=$i.csv.bz2
wget -O /tmp/flights/$fnam http://stat-computing.org/dataexpo/2009/$fnam
echo "$(date) $i Unzip"
bunzip2 /tmp/flights/$fnam
done
Save the script as download.sh
and do
chmod +x download.sh
./download.sh
Peek into the data:
head /tmp/flights/1987.csv
head /tmp/airlines.csv
head /tmp/airports.csv
Copy data into HDFS using the hadoop fs command.
# Copy flight data to HDFS
hadoop fs -mkdir /user/rstudio-user/flights/
hadoop fs -put /tmp/flights /user/rstudio-user/
# Copy airline data to HDFS
hadoop fs -mkdir /user/rstudio-user/airlines/
hadoop fs -put /tmp/airlines.csv /user/rstudio-user/airlines
# Copy airport data to HDFS
hadoop fs -mkdir /user/rstudio-user/airports/
hadoop fs -put /tmp/airports.csv /user/rstudio-user/airports
Launch Hive from the command line.
# Open Hive prompt
hive
Create the metadata that will structure the flights table. Load data into the Hive table.
# Create metadata for flights
CREATE EXTERNAL TABLE IF NOT EXISTS flights
(
year int,
month int,
dayofmonth int,
dayofweek int,
deptime int,
crsdeptime int,
arrtime int,
crsarrtime int,
uniquecarrier string,
flightnum int,
tailnum string,
actualelapsedtime int,
crselapsedtime int,
airtime string,
arrdelay int,
depdelay int,
origin string,
dest string,
distance int,
taxiin string,
taxiout string,
cancelled int,
cancellationcode string,
diverted int,
carrierdelay string,
weatherdelay string,
nasdelay string,
securitydelay string,
lateaircraftdelay string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");
# Load data into table
LOAD DATA INPATH '/user/rstudio-user/flights' INTO TABLE flights;
These hive commands will move the data from /user/rstudio-user/flights
to /user/hive/warehouse/flights
:
hadoop fs -ls /user/rstudio-user/flights
hadoop fs -ls /user/hive/warehouse/flights
Create the metadata that will structure the airlines table. Load data into the Hive table.
# Create metadata for airlines
CREATE EXTERNAL TABLE IF NOT EXISTS airlines
(
Code string,
Description string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar" = '\"'
)
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airlines' INTO TABLE airlines;
These hive commands will move the data from /user/rstudio-user/airlines
to /user/hive/warehouse/airlines
.
hadoop fs -ls /user/rstudio-user/airlines
hadoop fs -ls /user/hive/warehouse/airlines
Create the metadata that will structure the airports table. Load data into the Hive table.
# Create metadata for airports
CREATE EXTERNAL TABLE IF NOT EXISTS airports
(
id string,
name string,
city string,
country string,
faa string,
icao string,
lat double,
lon double,
alt int,
tz_offset double,
dst string,
tz_name string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = '\,',
"quoteChar" = '\"'
)
STORED AS TEXTFILE;
# Load data into table
LOAD DATA INPATH '/user/rstudio-user/airports' INTO TABLE airports;
These hive commands will move the data from /user/rstudio-user/airports
to /user/hive/warehouse/airports
.
hadoop fs -ls /user/rstudio-user/airports
hadoop fs -ls /user/hive/warehouse/airport
See Hive language reference for Hive syntax.
apt-get
is the package manager on Debian system.
Install dependencies:
# Update
sudo apt-get update
# used for devtools
sudo apt-get install libcurl4-openssl-dev libssl-dev libxml2-dev
Update R on master node:
sudo apt-get install r-base r-base-dev
Install commonly used R packages so they’re globally available to all users:
sudo R -e "install.packages(c('tidyverse', 'DBI', 'sparklyr', 'caTools', 'bitops'), repos='https://cran.rstudio.com/')"
Install RStudio Server on Debian 9:
# simple tool to install deb files
sudo apt-get install gdebi-core
wget https://download2.rstudio.org/rstudio-server-stretch-1.1.463-amd64.deb
sudo gdebi rstudio-server-stretch-1.1.463-amd64.deb
RStudio Server will be automatically started upon successful installation.
By default, port 8787 used by R Studio Server is blocked by VM firewall. On GCP console, go to VPC network and then Firewall rules, create a rule for R Studio Server (tcp: 8787), apply that rule to the YARN master node.
Update To increase security, in VPC Networks > Firewall rules, find a rule associated with port 8787, set Source filter
as “IP ranges”, and set Source IP ranges
as 147.46.0.0/16
and 147.47.0.0/16
in order to allow connections from the SNU network only.
Now you should be able to access R Studio Server on YARN master node by pointing your browser to the IP address of your master node, say http://35.187.203.86:8787
.
wonj
:sudo useradd -m wonj
The -m
option creates the home folder /home/wonj
.
sudo passwd wonj
sudo
group:sudo usermod -aG sudo wonj
sudo chsh -s /bin/bash wonj
su - wonj
To connect to Rtudio Server, you need password authentication. As before, edit /etc/ssh/sshd_config
file by
sudo vi /etc/ssh/sshd_config
and uncomment the line
#PasswordAuthentication yes
Don’t forget to comment out the line
PasswordAuthentication no
Restart the ssh daemon by
sudo service ssh restart
Now you should be able to log in using the username rstudio-user
and the password you created.
Source filter
as “IP ranges”, and set Source IP ranges
as 147.46.0.0/16
and 147.47.0.0/16
in order to allow connections from the SNU network only.http://35.187.203.86:8088
.Source filter
as “IP ranges”, and set Source IP ranges
as 147.46.0.0/16
and 147.47.0.0/16
in order to allow connections from the SNU network only.# list hadoop jobs
yarn application -list
# kill hadoop job
yarn application -kill APPLICATION_ID