Introduction to Fast Scalable Machine Learning with H2O - Plano

H2O.aiMachine Intelligence

What is H2O? Open source in-memory prediction engineMath Platform

• Parallelized and distributed algorithms making the most use out of multithreaded systems

• GLM, Random Forest, GBM, PCA, etc.

Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau

More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions

H2O Production Analytics Workflow

DistributedIn-Memory

Load Data

Loss-lessCompression

H2O Compute Engine

Production Scoring Environment

Exploratory &Descriptive

Analysis

Feature Engineering &

Selection

Supervised &Unsupervised

Modeling

ModelEvaluation &

Selection

Predict

Data & ModelStorage

Model Export:Plain Old Java Object

YourImagination

Data Prep Export:Plain Old Java Object

Ensembles

Deep Neural Networks

Algorithms on H2O

• Generalized Linear Models with Regularization: Binomial, Gaussian, Gamma, Poisson and Tweedie

• Naïve Bayes • Distributed Random Forest:

Classification or regression models• Gradient Boosting Machine: Produces

an ensemble of decision trees with increasing refined approximations

• Deep learning: Create multi-layer feed forward neural networks starting with an input layer followed by multiple layers of nonlinear transformations

Supervised Learning

Statistical Analysis

Dimensionality Reduction

Anomaly Detection

Algorithms on H2O

• K-means: Partitions observations into k clusters/groups of the same spatial size

• Principal Component Analysis: Linearly transforms correlated variables to independent components

• Generalized Low Rank Models*: extend the idea of PCA to handle arbitrary data consisting of numerical, Boolean, categorical, and missing data

• Autoencoders: Find outliers using a nonlinear dimensionality reduction using deep learning

Unsupervised Learning

Clustering

JavaScript R Python Excel/Tableau

Network

Rapids Expression Evaluation Engine Scala Customer

Algorithm

Customer AlgorithmParse

GLMGBMRF

Deep LearningK-Means

In-H2O Prediction Engine

H2O Software Stack

Fluid Vector Frame

Distributed K/V Store

Non-blocking Hash Map

MRTask

Fork/Join

Customer Algorithm

Spark Hadoop Standalone H2O

cientific Advisory CouncilStephen BoydProfessor of EE EngineeringStanford University

Rob TibshiraniProfessor of Health Researchand Policy, and StatisticsStanford University

Trevor HastieProfessor of StatisticsStanford University

H2O and R

Reading Data from HDFS into H2O with R

STEP 1

R user

h2o_df = h2o.importFile(“hdfs://path/to/data.csv”)

H2OH2O

data.csv

HTTP REST API request to H2Ohas HDFS path

H2O ClusterInitiate distributed

ingest

HDFSRequest

data from HDFS

STEP 22.2

h2o.importFile()

2.1R function

H2OH2O

STEP 3

Cluster IPCluster Port

Pointer to Data

Return pointer to data in

REST API JSON Response

HDFS provides

3.43.1h2o_df object

created in R

data.csv

h2o_dfH2O

3.2Distributed

H2OFrame in DKV

H2O Cluster

R Script Starting H2O GLM

REST/JSON

.h2o.startModelJob()POST /3/ModelBuilders/glm

h2o.glm()

R script

Standard R process

TCP/IP

REST/JSON

/3/ModelBuilders/glm endpoint

GLM algorithm

GLM tasks

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

R Script Retrieving H2O GLM Result

REST/JSON

h2o.getModel()GET /3/Models/glm_model_id

h2o.glm()

R script

Standard R process

TCP/IP

REST/JSON

/3/Models endpoint

Fork/Join framework

K/V store framework

H2O process

Network layer

REST layer

H2O - algos

H2O - core

User process

H2O process

Legend

H2O in Big Data Environments

Hadoop (and YARN)• You can launch H2O directly on Hadoop:

$ hadoop jar h2odriver.jar … -nodes 3 –mapperXmx 50g

• H2O uses Hadoop MapReduce to get CPU and Memory on the cluster, not to manage work– H2O mappers stay at 0% progress forever

• Until you shut down the H2O job yourself– All mappers (3 in this case) must be running at the same time– The mappers communicate with each other

• Form an H2O cluster on-the-spot within your Hadoop environment– No Hadoop reducers(!)

• Special YARN memory settings for large mappers– yarn.nodemanager.resource.memory-mb– yarn.scheduler.maximum-allocation-mb

• CPU resources controlled via –nthreads h2o command line argument

H2O on YARN DeploymentHadoop Gateway Node

hadoop jar h2odriver.jar …

YARN RM

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NM NM NM NMYARN NodeManagers (NM)

H2O H2O

Now You Have an H2O ClusterHadoop Gateway Node

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

NM NM NMYARN NodeManagers (NM)

H2O H2O

YARN RM

Read Data from HDFS OnceHadoop Gateway Node

HDFSHDFSData Nodes

Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

H2O H2O

YARN RM

Build Models in-MemoryHadoop Gateway Node

hadoop jar h2odriver.jar …Hadoop Cluster

H2OH2O Mappers

(YARN Containers)

YARN Worker Nodes

H2O H2O

YARN RM

Sparkling Water

Sparkling Water Application Life Cycle

Sparkling App

jar file

SparkMaster

spark-submit

SparkWorker

Sparkling Water Cluster

Spark Executor JVM

H2O(4)

Spark Executor JVM

Sparkling Water Data Distribution

Sparkling Water Cluster

Spark Executor JVMData

Source(e.g.

HDFS) (1)

H2ORDD

Spark Executor JVM

SparkRDD

How H2O Processes Data

Distributed Data TaxonomyVector

Distributed Data TaxonomyVector The vector may be very large

(billions of rows)

- Stored as a compressed column (often 4x)

- Access as Java primitives with on-the-fly decompression

- Support fast Random access

- Modifiable with Java memory semanticsH2O.aiMachine Intelligence

Distributed Data TaxonomyVector Large vectors must be distributed

over multiple JVMs

- Vector is split into chunks- Chunk is a unit of parallel access- Each chunk ~ 1000 elements- Per-chunk compression- Homed to a single node- Can be spilled to disk- GC very cheap

Distributed Data Taxonomy age gender zip_code ID

A row of data is always stored in a single JVM

Distributed data frame

- Similar to R data frame

- Adding and removingcolumns is cheap

Distributed Fork/Join

JVMtask

Task is distributed in a tree pattern

- Results are reduced at each inner node

- Returns with a singleresult when all subtasks done

Distributed Fork/Join

JVMtask

task task

tasktaskchunk

chunk chunk

On each node the task is parallelized using Fork/Join

Deploying H2OPOJO Models

H2O on Storm

(Spout)

(Bolt)

H2OScoringPOJO

Real-time data

Predictions

DataSource

(e.g. HDFS)

Emit POJO Model

Real-time StreamModeling workflow

Introduction to Fast Scalable Machine Learning with H2O - Plano

Technology

Fast and Scalable Knowledge Bundle Replication in Splunk

Fast Scalable R with H2O - usermanual.wiki2What is H2O? H2O is fast, scalable, open-source machine learning and deep learning for smarter applications. With H2O, enterprises like PayPal,

DBSCAN++: Towards fast and scalable density clusteringproceedings.mlr.press/v97/jang19a/jang19a.pdf · 2020. 8. 18. · DBSCAN++: Towards fast and scalable density clustering with

Fast, Scalable Demand-Shaping with ColorPower

Realizing Fast, Scalable and Reliable Scientific …iraicu/research/publications/2008_NOVA08...Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments Yong

Realizing Fast, Scalable and Reliable Scientific Computations in

Fast and Scalable Bayesian Deep Learning by Weight

ReactJS.NET - Fast and Scalable Single Page Applications

Member Fast Facts Scalable Data Systems Australia

A Fast, Scalable Mutual Exclusion Algorithm* - Department of

resources July 2016: First Edition · 2Sparkling Water Introduction Sparkling Water allows users to combine the fast, scalable machine learning algorithms of H2O with the capabilities

Building fast, scalable game server in node.js

A Fast, Scalable, and Reliable Deghosting Method for Extreme Exposure ...val.serc.iisc.ernet.in › ICCP19 › files › EF_iccp19.pdf · A Fast, Scalable, and Reliable Deghosting

H2O AutoML: Scalable Automatic Machine Learning · H2O AutoML (H2O.ai, 2017) is an automated machine learning algorithm included in the H2O framework (H2O.ai, 2013) that is simple

Fast, Scalable Graph Processing: Apache Giraph on YARN

RotNet: Fast and Scalable Estimation of Stellar Rotation

Scalable and Fast Machine Learning - Rice University · Scalable and Fast Machine Learning By Niketan Pansare (np6@rice.edu) Thursday, March 20, 14

PushApps - Smart, Fast and Scalable Push Notifications solution

Package ‘h2o’ · Package ‘h2o’ April 9, 2020 Version 3.30.0.1 Type Package Title R Interface for the 'H2O' Scalable Machine Learning Platform Date 2020-04-03 Description R

The Open Analytics Platform - KNIME...What is H2O? •Open-source machine learning framework •Fast, scalable and in-memory implementations of state-of-the-art machine learning algorithms