Upload
joseph-chang
View
1.187
Download
1
Embed Size (px)
Citation preview
© 2014 IBM Corporation
Bluemix Hadoop Beginner’s Guide -- Part I
Joseph Chang
Senior IT Specialist
IBM Cloud
Document number
• Ambari• HDFS Explore• WebHDFS API• Connect with R Console• Machine Learning (lm, k-means)
© 2014 IBM Corporation
Reference:
https://www.ng.bluemix.net/docs/services/AnalyticsforHadoop/index.html#analyticsforhadoop_data
2
Take me to BluemixClick Here
© 2014 IBM Corporation
Are you the target reader?
3
Have you heard about Bluemix?
Do you know Hadoop?
Do you know R language?
Are you interested in have the three things working
together?
Yes
Yes
Yes
Yes Continue to the next page.
Learn about Bluemix and sign-up.http://www.bluemix.net
Learn about Hadoophttps://hadoop.apache.org/
Learn about R.https://www.r-project.org/
No
No
No
NoBye
© 2014 IBM Corporation
The following 2 Bluemix services are used in this tutorial :
4
Assume you already have Bluemix id. If you don’t , go to http://ww.bluemix.net to
get one.
© 2014 IBM Corporation
Create Hadoop Service in Bluemix
5
Please create a java runtime and add a hadoop
service by yourself.
© 2014 IBM Corporation
Create Hadoop Service in Bluemix
6
You can get the AmbariUrl, WebhdfsUrl, id, password
… etc. from “Show Credentials”.
© 2014 IBM Corporation
Ambari Hadoop Management
7
© 2014 IBM Corporation
Monitoring Hadoop with Ambari
8
eg. https://bi-hadoop-prod-2016.services.dal.bluemix.net:8081
https://bi-hadoop-prod-<Cluster ID>.services.dal.bluemix.net:8081
Launch the Ambari Dashboard with this
URL.
© 2014 IBM Corporation
Ambari – View the detail information of each services
9
Note: Spark service is
available in this environment.
© 2014 IBM Corporation
Ambari – Hosts
10
The server nodes in this Hadoop
Cluster.
© 2014 IBM Corporation
Ambari – Cluster Stack Version
11
The Big R Service will be used in this tutoral.
© 2014 IBM Corporation
HDFS Explore
12
© 2014 IBM Corporation
HDFS Explore
13
eg. https://bi-hadoop-prod-2016.services.dal.bluemix.net:8443/gateway/default/hdfs/explorer.html
https://bi-hadoop-prod-<Cluster ID>.services.dal.bluemix.net:8443/gateway/default/hdfs/explorer.html
Launch the HDFS Explore with this
URL.
View the files on the Hadoop File
System. It’s ready only.
© 2014 IBM Corporation
HDFS – Healthy
14
© 2014 IBM Corporation
WEBHDFS REST API
15
© 2014 IBM Corporation
Upload Data with curl + webhdfs rest api
16
curl -i -L -k -s --user biblumix:<your_biblumix_password> --max-time 45 -X PUT https://bi-hadoop-prod-<your_cluster_number>.services.dal.bluemix.net:8443/ gateway/default/webhdfs/v1/user/biblumix/<path_to_file/file_name>?op=CREATE
curl -i -L -k -s --user biblumix:<your_biblumix_password> --max-time 45 -X PUT -T <file_name.txt> <Location URL from step 1 response message>
If you can’t run “curl” in your command line, google it and
download it.
Use WEBHDFS API to upload file
The current CREATE api have a
defect cause the uploaded file
size=0. The 2 steps approach is a workaround.
© 2014 IBM Corporation
Upload Data with curl + webhdfs rest api (Screen capture)
17
Step 1 Create temp redirect Step 2 Upload file from local disk
The location in step 1 response
message will be used in step 2.
You should get response code 307 in
step 1
You should get response code 201 in
step 2
© 2014 IBM Corporation
Upload Data with curl + webhdfs rest api (Result)
18
The file has been
uploaded.
Note the size should not be
0.
© 2014 IBM Corporation
More webhdfs rest api
19
curl -i -k -s --user biblumix:your_biblumix_password --max-time 45 https://bi-hadoop-prod-your_cluster_number.services.dal.bluemix.net:8443/ gateway/default/webhdfs/v1/user?op=LISTSTATUS
curl -i -s --user biblumix:passwordhttps://hostname:8443/gateway/default/oozie/v1/jobs?jobtype=wf
curl -i -s --user biblumix:password -X POST -H "Content-Type: application/xml" -d @oozie-mrjob-config.xml https://hostname:8443/gateway/default/oozie/v1/jobs?action=start
curl -i -s --user biblumix:your_biblumix_password --max-time 45 -X DELETE https://bi-hadoop-prod-your_cluster_number.services.dal.bluemix.net:8443/ gateway/default/webhdfs/v1/user/biblumix/path_to_file?op=DELETE
curl -i -k -s --user biblumix:your_biblumix_password --max-time 45 -X PUT https://bi-hadoop-prod-your_cluster_number.services.dal.bluemix.net:8443/ gateway/default/webhdfs/v1/user/biblumix/path_to_directory?op=MKDIRS
© 2014 IBM Corporation
Install R Console & Big R
20
© 2014 IBM Corporation
Download Drivers for Big R
21
https://hub.jazz.net/project/kulkarni/a4h/overview#https://hub.jazz.net/git/kulkarni%252Fa4h/list/master/client-libs
Extract the file to /temp
The big R library can be download
from this url.
© 2014 IBM Corporation
Install R Console
22
Download R Language https://cran.r-project.org/
Launch R console: Launch R in Terminal:
If you don’t have R console in your
PC/NB . Download it from this URL.
You can use either R Console
or terminal.
Type R in command line to launch R.
© 2014 IBM Corporation
Install Big R
> install.packages('rJava') --- Please select a CRAN mirror for use in this session ---
HTTPS CRAN mirror
1: 0-Cloud [https] 2: Austria [https]
3: Chile [https] 4: China (Beijing 4) [https]
5: China (Hefei) [https] 6: Colombia (Cali) [https]
7: France (Lyon 2) [https] 8: Germany (Münster) [https]
9: Iceland [https] 10: Russia (Moscow) [https]
11: Spain (A Coruña) [https] 12: Switzerland [https]
13: UK (Bristol) [https] 14: UK (Cambridge) [https]
15: USA (CA 1) [https] 16: USA (KS) [https]
17: USA (MI 1) [https] 18: USA (TN) [https]
19: USA (TX) [https] 20: USA (WA) [https]
21: (HTTP mirrors)
Selection: 1
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/rJava_0.9-7.tgz' Content type 'application/x-gzip' length 604271 bytes (590 KB) ================================================== downloaded 590 KB
The downloaded binary packages are in
/var/folders/g0/jgl74nkx0h97dgpywqv2prrc0000gn/T//Rtmp3ggvb8/downloaded_packages
23
Warning message:In doTryCatch(return(expr), name, parentenv, handler) : unable to load shared object '/Library/Frameworks/R.framework/Resources/modules//R_X11.so': dlopen(/Library/Frameworks/R.framework/Resources/modules//R_X11.so, 6): Library not loaded: /opt/X11/lib/libSM.6.dylib Referenced from: /Library/Frameworks/R.framework/Resources/modules//R_X11.so Reason: image not found>
Before install big R package. We
need install rJava.
© 2014 IBM Corporation
Install Big R
24
> install.packages('base64enc')trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/base64enc_0.1-3.tgz'Content type 'application/x-gzip' length 26679 bytes (26 KB)==================================================downloaded 26 KB
The downloaded binary packages are in/var/folders/g0/jgl74nkx0h97dgpywqv2prrc0000gn/T//Rtmp3ggvb8/downloaded_packages
> install.packages('data.table')
also installing the dependencies ‘stringi’, ‘magrittr’, ‘plyr’, ‘stringr’, ‘Rcpp’, ‘chron’, ‘reshape2’
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/stringi_0.5-5.tgz'Content type 'application/x-gzip' length 12685069 bytes (12.1 MB)==================================================downloaded 12.1 MB
….
trying URL 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.2/data.table_1.9.4.tgz'Content type 'application/x-gzip' length 1266610 bytes (1.2 MB)==================================================downloaded 1.2 MB
The downloaded binary packages are in/var/folders/g0/jgl74nkx0h97dgpywqv2prrc0000gn/T//Rtmp3ggvb8/downloaded_packages
>
Before install big R package. We need install base64enc
and data.table
© 2014 IBM Corporation
Install Big R
25
> install.packages(pkg="/temp/bigr_3.18.tar.gz", type="source", repos=NULL)
* installing *source* package ‘bigr’ ...** R** inst** preparing package for lazy loadingAttaching...Creating a generic function for ‘toString’ from package ‘base’ in package ‘bigr’Creating a generic function for ‘nchar’ from package ‘base’ in package ‘bigr’Creating a generic function for ‘coef’ from package ‘stats’ in package ‘bigr’** help*** installing help indices** building package indices** testing if installed package can be loaded* DONE (bigr)>
Now you can install Big R
library.
Make sure the library path is
correct.
© 2014 IBM Corporation
Install Big R (2 issues in Bluemix doc)
26
If you copy the command in
Bluemix doc , you may got this error.
(Aug.2015)
It should be “packages”
The Bluemix instruction dosen’t
mention the 3 libraries need to be installed first.
© 2014 IBM Corporation
Machine Learning
-- Liner Regression-- K-means
27
http://www-01.ibm.com/support/knowledgecenter/SSPT3X_4.0.0/com.ibm.swg.im.infosphere.biginsights.bigr.doc/doc/intro.html?cp=SSPT3X_4.0.0%2F9-1
Reference:
Recommendation:Learn more about
big R from the URL.
© 2014 IBM Corporation
Machine Learning – Big R Example #1-1
28
############################# 1.1 Connect to Bluemix Hadoop#############################
# In order to try out any example, first run the following steps to upload# the aforementioned dataset to a BigInsights cluster. library(bigr)
bigr.connect(host="bi-hadoop-prod-2016.services.dal.bluemix.net",user="biblumix", password="w9@4f0~HnXLD",ssl=TRUE, trustStorePath="/Library/Java/Home/lib/security/cacerts", trustStorePassword="changeit",keyManager="SunX509")
is.bigr.connected()
Replace it with your own cluster id
Replace it with your own password
Replace it with the Java Home path in your Environment.
© 2014 IBM Corporation
Machine Learning – Big R Example #1-2
29
################## 1.2 Data loading#################
airfile <- system.file("extdata", "airline.zip", package="bigr”)airfile <- unzip(airfile, exdir = tempdir())airR <- read.csv(airfile, stringsAsFactors=F)
# Upload the data to the BigInsights server. This may take 15-20 secondsair <- as.bigr.frame(airR)air <- bigr.persist(air, dataSource="DEL", dataPath="/user/bigr/examples/airline_demo.csv”, header=T, delimiter=",", useMapReduce=F)
The file uses “,” as DELimiter
© 2014 IBM Corporation
Big R Example #1 (Screen capture)
30
You can check if the file is
successfully upload by HDFS explore.
You should get “TRUE” if successfully
connect to bluemix hadoop.
© 2014 IBM Corporation
About the airline.csv sample data
31
The airline.zip sample can be found in your R installation directory.
© 2014 IBM Corporation
Machine Learning – Big R Example #2
32
############################ 2. Accessing data on HDFS###########################
# Once uploaded, one merely needs to instantiate a big.frame object,# commonly referenced as "air" in the examples, to access the dataset via# the Big R API.air <- bigr.frame(dataPath = "/user/bigr/examples/airline_demo.csv", dataSource = "DEL", delimiter=",", header = T, coltypes = ifelse(1:29 %in% c(9,11,17,18,23), "character", "integer"), useMapReduce = F)
There are 29 columns in the
airline_dmeo.csv file. Column
9,11,17,18,23 are character. Remaining columns are integer.
© 2014 IBM Corporation
Big R Example #2 (Screen capture)
33
© 2014 IBM Corporation
Machine Learning – Big R Example #3-1
34
################################################################## 3. Machine Learning example: building a Linear Regression model#################################################################
# Remove files from previous executions (if any)invisible(bigr.rmfs("/user/bigr/examples/airline.sample.* /user/bigr/examples/lm.airline*"))
# Project some relevant columns for modeling / statistical analysisairlineFiltered <- air[, c("Month", "DayofMonth", "DayOfWeek", "CRSDepTime", "Distance", "ArrDelay")]
# Create a bigr.matrix from the dataairlineMatrix <- bigr.transform(airlineFiltered, outData="/user/bigr/examples/airline.sample.matrix", transformPath="/user/bigr/examples/airline.sample.transform")
The 6 variables are choose for this model.
© 2014 IBM Corporation
Machine Learning – Big R Example #3-2
35
################################################################## 3. Machine Learning example: building a Linear Regression model#################################################################
# Split the data into 70% for training and 30% for testingsamples <- bigr.sample(airlineMatrix, perc=c(0.7, 0.3))train <- samples[[1]]test <- samples[[2]]
# Create a linear regression modellm <- bigr.lm(ArrDelay ~ ., data=train, directory="/user/bigr/examples/lm.airline")
# Get the coefficients of the regressioncoef(lm)
We will use "Month", "DayofMonth", "DayOfWeek", "CRSDepTime”, "Distance” to predict ArrDelay
© 2014 IBM Corporation
Big R Example #3 (Screen Capture)
36
© 2014 IBM Corporation
Big R Example #3-2 (Result)
37
Y : ArrDelayX1: MonthX2: DayofMonthX3: DayOfweekX4: CRSDepTImeX5: Distance
Y = -0.174423*X1 -0.01547941*X2-0.03378236*X3 +0.006222544*X4+0.0003556919*X5
The Arrival Delay prediction model is :
© 2014 IBM Corporation
Machine Learning – Big R Example #3-3
38
################################################################## 3. Machine Learning example: building a Linear Regression model#################################################################
# Calculate predictions for the testing setpred <- predict(lm, test, "/user/bigr/examples/lm.airline.preds")
© 2014 IBM Corporation
Big R Example #3 (Screen Capture)
39
Predicted arrival delay time for test data.
© 2014 IBM Corporation
Big R Example #3 (output)
40
View the preds files generate on the hdfs.
© 2014 IBM Corporation
Machine Learning – Big R Example #4
41
################################################################### 4. Machine Learning example: building a k-means clustering model##################################################################
# Remove files from previous executions (if any)invisible(bigr.rmfs("/user/bigr/examples/iris.* /user/bigr/examples/km*"))
# Load the Iris dataset to HDFSirisbf <- as.bigr.frame(iris[, -5])
# Convert the Iris dataset into a bigr.matrix objectirisBM <- bigr.transform(bf = irisbf, outData = "/user/bigr/examples/iris.mtx", transformPath = "/user/bigr/examples/iris.transform")
# Create a k-means model with 10 clusterskm <- bigr.kmeans(irisBM, centers=10, directory="/user/bigr/examples/km", writeY=T)
# Use the existing model to cluster a different datasetp <- predict(km, irisBM, "/user/bigr/examples/km.preds")
Iris is the built-in sample data set in R Language
© 2014 IBM Corporation
About the sample data -- IRIS
42
© 2014 IBM Corporation
Big R Example #4 (Screen Capture)
43
The 10 clusters of IRIS by k-means.
© 2014 IBM Corporation
Big R Example #4 (Screen capture)
44
Identify each sample data with the model.
© 2014 IBM Corporation
Appendix 1: Hadoop Cloud Demo with IBM Bluemix
46
I found this great video in Youtube.You can learn more about Bluemix Hadoop in this video.
© 2014 IBM Corporation
Big Data Hadoop Cloud Demo – IBM Bluemixhttps://www.youtube.com/watch?v=FUDOsBDAahE
47
© 2014 IBM Corporation
Appendix 2: Define Hadoop Cluster by yourself
48
If you want your application run faster. You may choose this charged service which running on bare metal servers with multiple nodes.
© 2014 IBM Corporation
BigInsights for Hadoop Cluster Topology
49
© 2014 IBM Corporation
BigInsights for Hadoop Cluster Topology
50