Upload
hadoopsummit
View
158
Download
6
Tags:
Embed Size (px)
DESCRIPTION
Hadoop, being a disruptive data processing framework, has made a large impact in the data ecosystems of today. Enabling business users to translate existing skills to Hadoop is necessary to encourage the adoption and allow businesses to get value out of their Hadoop investment quickly. R, being a prolific and rapidly growing data analysis language, now has a place in the Hadoop ecosystem. With the advent of technologies such as RHadoop, optimizing R workloads for use on Hadoop has become much easier. This session will help you understand how RHadoop projects such as RMR, and RHDFS work with Hadoop, and will show you examples of using these technologies on the Hortonworks Data Platform.
Citation preview
© Hortonworks Inc. 2012
Enabling R on Hadoop July 11, 2013
Page 1
© Hortonworks Inc. 2012
Your Presenters
Ravi Mutyala Systems Architect
Page 2
Paul Codding Solutions Engineer
© Hortonworks Inc. 2012
Agenda
• A Brief History of R • How R is Typically Used • How R is Used with Hadoop • Getting Started
Page 3
© Hortonworks Inc. 2012
A Brief History of R
Page 4
© Hortonworks Inc. 2012
History of R
Page 5
1976: S Fortran John
Chambers
S
1988: S V3 written in C & statistical
models included
1998: S V4
1991: R Created by
Ross Ihaka & Robert
Gentleman
R
1997: R Core Group
Formed
2000: R Version 1.0
released
© Hortonworks Inc. 2012
How R is Typically Used
Page 6
© Hortonworks Inc. 2012
Main Uses of R
• Statistical Analysis & Modeling – Classification – Scoring – Ranking – Clustering – Finding relationships – Characterization
• Common Uses – Interactive Data Analysis – General Purpose Statistics – Predictive Modeling
Page 7
© Hortonworks Inc. 2012
How R is Used with Hadoop
Page 8
© Hortonworks Inc. 2012
Hadoop Components
Page 9
OS Cloud VM Appliance
PLATFORM SERVICES
HADOOP CORE
DATA SERVICES
OPERATIONAL SERVICES
Manage & Operate at
Scale
Store, Process and Access Data
Enterprise Readiness: HA, DR, Snapshots, Security, …
HORTONWORKS DATA PLATFORM (HDP)
Distributed Storage & Processing HDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
HCATALOG
HIVE PIG HBASE
SQOOP
FLUME
OOZIE
AMBARI
© Hortonworks Inc. 2012
Hadoop Components & R
Page 10
OS Cloud VM Appliance
PLATFORM SERVICES
HADOOP CORE
DATA SERVICES
OPERATIONAL SERVICES
Manage & Operate at
Scale
Store, Process and Access Data
Enterprise Readiness: HA, DR, Snapshots, Security, …
HORTONWORKS DATA PLATFORM (HDP)
Distributed Storage & Processing HDFS YARN (in 2.0)
WEBHDFS MAP REDUCE
HCATALOG
HIVE PIG HBASE
SQOOP
FLUME
OOZIE
AMBARI
Data Service Components • Hive • HBase Hadoop Core • Map Reduce • HDFS
© Hortonworks Inc. 2012
Options for R on Hadoop
• Options – RODBC/RJDBC – RHive – RHadoop
• Analysis – Focus – Integration Ease – Benefits – Limitations
Page 11
RHadoop
RODBC/RJDBC
RHive
© Hortonworks Inc. 2012
RODBC/RJDBC
• Focus – SQL Access from R
• Integration Ease – Install Hortonworks Hive ODBC Driver – Install Hive libraries
• Benefits – Low impact on existing R scripts leveraging other DB packages – Not required to install Hadoop configuration/binaries on client
machines
• Limitations – Parallelism limited to Hive – Result set size
Page 12
© Hortonworks Inc. 2012
Deployment Considerations
Page 13
TT , DN
.
.
.
.
.
.
.
TT , DNJT
NN
HS
© Hortonworks Inc. 2012
RHive
• Focus – Broad access to Hive and HDFS
• Integration Ease – Requires Hadoop binaries, libraries, and configuration files on
client machines – Uses Java DFS Client and HiveServer
• Benefits – Wide range of features expressed through HQL
– rhive-apply R Distributed apply function using HQL
• Limitations – Requires heavy client deployment – Dependent on HiveServer, and can’t be used with HiveServer2
Page 14
© Hortonworks Inc. 2012
Deployment Considerations
Page 15
TT + DN
.
.
.
.
.
.
.
TT + DN
JT
R Edge Node N
NH
S
© Hortonworks Inc. 2012
RHadoop
• Focus – Tight integration with core Hadoop components
• Benefit – Ability to run R on a massively distributed system
– Ability to work with full data sets instead of sample sets
• Additional Information – https://github.com/RevolutionAnalytics/RHadoop/wiki
Page 16
© Hortonworks Inc. 2012
RHadoop Architecture
Page 17
R
rhdfs
rhbase
rmr2
HDFS
HBase Thrift Gateway
Map Reduce
HBase
Streaming
R
RR
R
© Hortonworks Inc. 2012
rhdfs
• Access HDFS from R • Read from HDFS to R dataframe • Write from R dataframe to HDFS • 1.0.6 adds support for Windows (using HDP)
Page 18
© Hortonworks Inc. 2012
rhdfs
• Hadoop CLI Commands & rhdfs equivalent • hadoop fs –ls /
– hdfs.ls(“/”)
• hadoop fs –mkdir /user/rhdfs/ppt – hdfs.mkdir(“/user/rhdfs/ppt”)
• hadoop fs –put 1.txt /user/rhfds/ppt/ – localData <- system.file(file.path("unitTestData", ”1.txt"), package="rhdfs”) – hdfs.put(localData, ”/user/rhdfs/ppt/1.txt”)
• hadoop fs –get /user/rhdfs/ppt/1.txt 1.txt – hdfs.get(”/user/rhdfs/ppt/1.txt”,”test”)
• hadoop fs –rm /user/rhdfs/ppt/1.txt – hdfs.delete(“/user/rhdfs/ppt/1.txt”)
Page 19
© Hortonworks Inc. 2012
rhbase
• Access and change data within HBase • Uses Thrift API • Command Examples
– hb.new.table – hb.insert – hb.scan.ex – hb.scan
Page 20
© Hortonworks Inc. 2012
rmr2
• Enables writing MapReduce jobs using R • Ability to parallelize algorithms • Ability to use big data sets without needing to sample data
• mapreduce(input, output, map, reduce, …) • Reduces takes a key and a collection of values which could be vector, list, data frame or matrix
• 2.2.1 adds support for Windows (using HDP)
Page 21
© Hortonworks Inc. 2012
Sample code - wordcount
Page 22
wc.map = ! function(., lines) {! keyval(! unlist(! strsplit(! x = lines,! split = pattern)),! 1)}!wc.reduce =! function(word, counts ) {! keyval(word, sum(counts))}!!mapreduce(! input = input ,! output = output,! input.format = "text",! map = wc.map,! reduce = wc.reduce,! combine = T)}!
© Hortonworks Inc. 2012
More Sample Code
Page 23
groups = rbinom(32, n = 50, prob = 0.4)! tapply(groups, groups, length)!
groups = to.dfs(groups)! from.dfs(! mapreduce(! input = groups,! map = function(., v) keyval(v, 1),! reduce =! function(k, vv)! keyval(k, length(vv))))!
© Hortonworks Inc. 2012
Deployment Considerations
Page 24
TT , DN, RS
R
.
.
.
.
.
.
.
TT , DN, RS
RJT
R Edge Node N
NH
T G
© Hortonworks Inc. 2012
RHadoop
• Limitations – Requires installation of R on all TaskTracker nodes – Does not automatically parallelize algorithms – Different slot/memory configuration recommended to leave
memory and CPU resources for R
Page 25
OS
Map Reduce
OS
Map Reduce
R
© Hortonworks Inc. 2012
Getting Started
Page 26
© Hortonworks Inc. 2012
Your Fastest On-ramp to Enterprise Hadoop™!
Page 27
http://hortonworks.com/products/hortonworks-sandbox/
The Sandbox lets you experience Apache Hadoop from the convenience of your own laptop – no data center, no cloud and no internet connection needed! The Hortonworks Sandbox is: • A free download: http://hortonworks.com/products/hortonworks-sandbox/ • A complete, self contained virtual machine with Apache Hadoop pre-configured • A personal, portable and standalone Hadoop environment • A set of hands-on, step-by-step tutorials that allow you to learn and explore Hadoop
© Hortonworks Inc. 2012
Installation
• Install R on all nodes • Install dependent packages – RJSONIO – itertools – digest – Rcpp – rJava – functional – RCurl – httr – plyr
• Download & Install RHadoop Packages – rmr2 – rhdfs – rhbase (requires Thrift)
Page 28
© Hortonworks Inc. 2012
Questions & Answers
TRY Download HDP at hortonworks.com
LEARN Applying Data Science using Apache Hadoop Training
FOLLOW twitter: @hortonworks Facebook: facebook.com/hortonworks
Page 29
Further questions & comments: [email protected]