Transcript
Page 1: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

What is H2O? Open source in-memory prediction engineMath Platform

• Parallelized and distributed algorithms making the most use out of multithreaded systems

• GLM, Random Forest, GBM, PCA, etc.

Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau

More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions

Page 2: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

H2O Production Analytics Workflow

HDFS

S3

NFS

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Local

Page 3: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes • Distributed Random Forest:

Classification or regression models• Gradient Boosting Machine: Produces

an ensemble of decision trees with increasing refined approximations

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis

Page 4: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means: Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

Page 5: Introduction to Fast Scalable Machine Learning with H2O - Plano

JavaScript R Python Excel/Tableau

Network

Rapids Expression Evaluation Engine Scala Customer

Algorithm

Customer AlgorithmParse

GLMGBMRF

Deep LearningK-Means

PCA

In-H2O Prediction Engine

H2O Software Stack

Fluid Vector Frame

Distributed K/V Store

Non-blocking Hash Map

Job

MRTask

Fork/Join

Flow

Customer Algorithm

Spark Hadoop Standalone H2O

Page 6: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

cientific Advisory CouncilStephen BoydProfessor of EE EngineeringStanford University

Rob TibshiraniProfessor of Health Researchand Policy, and StatisticsStanford University

Trevor HastieProfessor of StatisticsStanford University

Page 7: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O and R

Page 8: Introduction to Fast Scalable Machine Learning with H2O - Plano

Reading Data from HDFS into H2O with R

STEP 1

R user

h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)

Page 9: Introduction to Fast Scalable Machine Learning with H2O - Plano

Reading Data from HDFS into H2O with R

H2OH2O

H2O

data.csv

HTTP REST API request to H2Ohas HDFS path

H2O ClusterInitiate distributed

ingest

HDFSRequest

data from HDFS

STEP 22.2

2.3

2.4

R

h2o.importFile()

2.1R function

call

Page 10: Introduction to Fast Scalable Machine Learning with H2O - Plano

Reading Data from HDFS into H2O with R

H2OH2O

H2O

R

HDFS

STEP 3

Cluster IPCluster Port

Pointer to Data

Return pointer to data in

REST API JSON Response

HDFS provides

data

3.3

3.43.1h2o_df object

created in R

data.csv

h2o_dfH2O

Frame

3.2Distributed

H2OFrame in DKV

H2O Cluster

Page 11: Introduction to Fast Scalable Machine Learning with H2O - Plano

R Script Starting H2O GLM

HTTP

REST/JSON

.h2o.startModelJob()POST /3/ModelBuilders/glm

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/ModelBuilders/glm endpoint

Job

GLM algorithm

GLM tasks

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

Page 12: Introduction to Fast Scalable Machine Learning with H2O - Plano

R Script Retrieving H2O GLM Result

HTTP

REST/JSON

h2o.getModel()GET /3/Models/glm_model_id

h2o.glm()

R script

Standard R process

TCP/IP

HTTP

REST/JSON

/3/Models endpoint

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

Page 13: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O in Big Data Environments

Page 14: Introduction to Fast Scalable Machine Learning with H2O - Plano

Hadoop (and YARN)• You can launch H2O directly on Hadoop:

$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g

• H2O uses Hadoop MapReduce to get CPU and Memory on the cluster, not to manage work– H2O mappers stay at 0% progress forever

• Until you shut down the H2O job yourself– All mappers (3 in this case) must be running at the same time– The mappers communicate with each other

• Form an H2O cluster on-the-spot within your Hadoop environment– No Hadoop reducers(!)

• Special YARN memory settings for large mappers– yarn.nodemanager.resource.memory-mb– yarn.scheduler.maximum-allocation-mb

• CPU resources controlled via –nthreads h2o command line argument

Page 15: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O on YARN DeploymentHadoop Gateway Node

hadoop jar h2odriver.jar …

YARN RM

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NM NM NM NMYARN NodeManagers (NM)

H2O H2O

Page 16: Introduction to Fast Scalable Machine Learning with H2O - Plano

Now You Have an H2O ClusterHadoop Gateway Node

hadoop jar h2odriver.jar …

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Page 17: Introduction to Fast Scalable Machine Learning with H2O - Plano

Read Data from HDFS OnceHadoop Gateway Node

hadoop jar h2odriver.jar …

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Page 18: Introduction to Fast Scalable Machine Learning with H2O - Plano

Build Models in-MemoryHadoop Gateway Node

hadoop jar h2odriver.jar …Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Page 19: Introduction to Fast Scalable Machine Learning with H2O - Plano

Sparkling Water

Page 20: Introduction to Fast Scalable Machine Learning with H2O - Plano

Sparkling Water Application Life Cycle

Sparkling App

jar file

SparkMaster

JVM

spark-submit

SparkWorker

JVM

SparkWorker

JVM

SparkWorker

JVM

(1)

(2)

(3)

Sparkling Water Cluster

Spark Executor JVM

H2O(4)

Spark Executor JVM

H2O

Spark Executor JVM

H2O

Page 21: Introduction to Fast Scalable Machine Learning with H2O - Plano

Sparkling Water Data Distribution

H2O

H2O

H2O

Sparkling Water Cluster

Spark Executor JVMData

Source(e.g.

HDFS) (1)

(2)

(3)

H2ORDD

Spark Executor JVM

Spark Executor JVM

SparkRDD

Page 22: Introduction to Fast Scalable Machine Learning with H2O - Plano

How H2O Processes Data

Page 23: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data TaxonomyVector

H2O.aiMachine Intelligence

Page 24: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data TaxonomyVector The vector may be very large

(billions of rows)

- Stored as a compressed column (often 4x)

- Access as Java primitives with on-the-fly decompression

- Support fast Random access

- Modifiable with Java memory semanticsH2O.aiMachine Intelligence

Page 25: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data TaxonomyVector Large vectors must be distributed

over multiple JVMs

- Vector is split into chunks- Chunk is a unit of parallel access- Each chunk ~ 1000 elements- Per-chunk compression- Homed to a single node- Can be spilled to disk- GC very cheap

H2O.aiMachine Intelligence

Page 26: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Data Taxonomy age gender zip_code ID

A row of data is always stored in a single JVM

Distributed data frame

- Similar to R data frame

- Adding and removingcolumns is cheap

H2O.aiMachine Intelligence

Page 27: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Fork/Join

JVMtask

JVMtask

JVMtask

JVMtask

JVMtask

Task is distributed in a tree pattern

- Results are reduced at each inner node

- Returns with a singleresult when all subtasks done

H2O.aiMachine Intelligence

Page 28: Introduction to Fast Scalable Machine Learning with H2O - Plano

Distributed Fork/Join

JVMtask

JVM

task

task task

tasktaskchunk

chunk chunk

On each node the task is parallelized using Fork/Join

H2O.aiMachine Intelligence

Page 29: Introduction to Fast Scalable Machine Learning with H2O - Plano

Deploying H2OPOJO Models

Page 30: Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O on Storm

(Spout)

(Bolt)

H2OScoringPOJO

Real-time data

Predictions

DataSource

(e.g. HDFS)

H2O

Emit POJO Model

Real-time StreamModeling workflow


Recommended