30
Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016

Erin LeDell @ledell DSC July 2016

Embed Size (px)

Citation preview

Page 1: Erin LeDell @ledell DSC July 2016

Scalable Machine Learning in R

with H2O

Erin LeDell @ledell

DSC July 2016

Page 2: Erin LeDell @ledell DSC July 2016

Introduction

• Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA

• Ph.D. in Biostatistics with Designated Emphasis in Computational Science and Engineering from UC Berkeley (focus on Machine Learning)

• Written a handful of machine learning R packages

Page 3: Erin LeDell @ledell DSC July 2016

Agenda

• Who/What is H2O? • H2O Platform

• H2O Distributed Computing • H2O Machine Learning

• H2O in R

Page 4: Erin LeDell @ledell DSC July 2016

H2O.ai

H2O.ai, the Company

H2O, the Platform

• Team: 60; Founded in 2012 • Mountain View, CA • Stanford & Purdue Math & Systems Engineers

• Open Source Software (Apache 2.0 Licensed) • R, Python, Scala, Java and Web Interfaces • Distributed Algorithms that Scale to Big Data

Page 5: Erin LeDell @ledell DSC July 2016

Scienti f ic Advisory Council

• John A. Overdeck Professor of Mathematics, Stanford University • PhD in Statistics, Stanford University • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Co-author with John Chambers, Statistical Models in S • Co-author, Generalized Additive Models

Dr. Trevor Hastie

• Professor of Statistics and Health Research and Policy, Stanford University • PhD in Statistics, Stanford University • Co-author, The Elements of Statistical Learning: Prediction, Inference and Data Mining • Author, Regression Shrinkage and Selection via the Lasso • Co-author, An Introduction to the Bootstrap

Dr. Robert Tibshirani

• Professor of Electrical Engineering and Computer Science, Stanford University • PhD in Electrical Engineering and Computer Science, UC Berkeley • Co-author, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers • Co-author, Linear Matrix Inequalities in System and Control Theory • Co-author, Convex Optimization

Dr. Steven Boyd

Page 6: Erin LeDell @ledell DSC July 2016

H2O Platform

Page 7: Erin LeDell @ledell DSC July 2016

H2O Platform Overview

• Distributed implementations of cutting edge ML algorithms. • Core algorithms written in high performance Java. • APIs available in R, Python, Scala, REST/JSON. • Interactive Web GUI.

Page 8: Erin LeDell @ledell DSC July 2016

H2O Platform Overview

• Write code in high-level language like R (or use the web GUI) and output production-ready models in Java.

• To scale, just add nodes to your H2O cluster. • Works with Hadoop, Spark and your laptop.

Page 9: Erin LeDell @ledell DSC July 2016

H2O Distr ibuted Computing

H2O Cluster

H2O Frame

• Multi-node cluster with shared memory model. • All computations in memory. • Each node sees only some rows of the data. • No limit on cluster size.

• Distributed data frames (collection of distributed arrays). • Columns are distributed across the cluster • Single row is on a single machine. • Syntax is the same as R’s data.frame or Python’s

pandas.DataFrame

Page 10: Erin LeDell @ledell DSC July 2016

H2O Communication

Reliable RPC

• H2O requires network communication to JVMs in unrelated process or machine memory spaces.

• Performance is network dependent.

Optimizations

• H2O implements a reliable RPC which retries failed communications at the RPC level.

• We can pull cables from a running cluster, and plug them back in, and the cluster will recover.

Network Communication

• Message data is compressed in a variety of ways (because CPU is cheaper than network).

• Short messages are sent via 1 or 2 UDP packets; larger message use TCP for congestion control.

Page 11: Erin LeDell @ledell DSC July 2016

Data Processing in H2O

Group By

• Map/Reduce is a nice way to write blatantly parallel code; we support a particularly fast and efficient flavor.

• Distributed fork/join and parallel map: within each node, classic fork/join.

Ease of Use

• We have a GroupBy operator running at scale. • GroupBy can handle millions of groups on billions of

rows, and runs Map/Reduce tasks on the group members.

Map Reduce

• H2O has overloaded all the basic data frame manipulation functions in R and Python.

• Tasks such as imputation and one-hot encoding of categoricals is performed inside the algorithms.

Page 12: Erin LeDell @ledell DSC July 2016

H2O on Spark

Sparkling Water

• Sparkling Water is transparent integration of H2O into the Spark ecosystem.

• H2O runs inside the Spark Executor JVM.

Features

• Provides access to high performance, distributed machine learning algorithms to Spark workflows.

• Alternative to the default MLlib library in Spark.

Page 13: Erin LeDell @ledell DSC July 2016

SparkR Implementation Details

• Central controller: • Explicitly “broadcast” auxiliary objects to worker nodes

• Distributed workers: • Scala code spans Rscript processes • Scala communicates with worker processes via stdin/stout using custom protocol • Serializes data via R serialization, simple binary serialization of integers,

strings, raw byes • Hides distributed operations

• Same function names for local and distributed computation • Allows same code for simple case, distributed case

Page 14: Erin LeDell @ledell DSC July 2016

H2O vs SparkR

• Although SparkML / MLlib (in Scala) supports a good number of algorithms, SparkR still only supports GLMs.

• Major differences between H2O and Spark: • In SparkR, R each worker has to be able to access local

R interpreter. • In H2O, there is only a (potentially local) instance of R

driving the distributed computation in Java.

Page 15: Erin LeDell @ledell DSC July 2016

H2O Machine Learning

Page 16: Erin LeDell @ledell DSC July 2016

Current Algor ithm Overview

Statistical Analysis

• Linear Models (GLM) • Naïve Bayes

Ensembles

• Random Forest • Distributed Trees • Gradient Boosting Machine • R Package - Stacking / Super

Learner

Deep Neural Networks

• Multi-layer Feed-Forward Neural Network

• Auto-encoder • Anomaly Detection • Deep Features

Clustering

• K-Means

Dimension Reduction

• Principal Component Analysis • Generalized Low Rank Models

Solvers & Optimization

• Generalized ADMM Solver • L-BFGS (Quasi Newton Method) • Ordinary Least-Square Solver • Stochastic Gradient Descent

Data Munging

• Scalable Data Frames • Sort, Slice, Log Transform

Page 17: Erin LeDell @ledell DSC July 2016

H2O in R

Page 18: Erin LeDell @ledell DSC July 2016

h2o R Package

Installation

• Java 7 or later; R 3.1 and above; Linux, Mac, Windows • The easiest way to install the h2o R package is CRAN. • Latest version: http://www.h2o.ai/download/h2o/r

Design

All computations are performed in highly optimized Java code in the H2O cluster, initiated by REST calls from R.

Page 19: Erin LeDell @ledell DSC July 2016

h2o R Package

Page 20: Erin LeDell @ledell DSC July 2016

Load Data into R

Page 21: Erin LeDell @ledell DSC July 2016

Train a Model & Predict

Page 22: Erin LeDell @ledell DSC July 2016

Grid Search

Page 23: Erin LeDell @ledell DSC July 2016

H2O Ensemble

Page 24: Erin LeDell @ledell DSC July 2016

Plott ing Results

plot(fit) plots scoring history over time.

Page 25: Erin LeDell @ledell DSC July 2016

H2O R Code

https://github.com/h2oai/h2o-3/blob/master/h2o-r/h2o-package/R/gbm.R

https://github.com/h2oai/h2o-3/blob/26017bd1f5e0f025f6735172a195df4e794f31

1a/h2o-r/h2o-package/R/models.R#L103

Page 26: Erin LeDell @ledell DSC July 2016

H2O Resources

• H2O Online Training: http://learn.h2o.ai • H2O Tutorials: https://github.com/h2oai/h2o-tutorials • H2O Slidedecks: http://www.slideshare.net/0xdata • H2O Video Presentations: https://www.youtube.com/user/0xdata • H2O Community Events & Meetups: http://h2o.ai/events

Page 27: Erin LeDell @ledell DSC July 2016

Tutor ial: Intro to H2O Algor ithms

The “Intro to H2O” tutorial introduces five popular supervised machine

learning algorithms in the context of a binary classification problem.

The training module demonstrates how to train models and evaluating model performance on a test set.

• Generalized Linear Model (GLM)

• Random Forest (RF)

• Gradient Boosting Machine (GBM)

• Deep Learning (DL)

• Naive Bayes (NB)

Page 28: Erin LeDell @ledell DSC July 2016

Tutor ial: Gr id Search for Model Select ion

The second training module demonstrates

how to find the best set of model parameters for each model using

Grid Search.

Page 29: Erin LeDell @ledell DSC July 2016

H2O Booklets

http://www.h2o.ai/docs

Page 30: Erin LeDell @ledell DSC July 2016

Thank you!

@ledell on Github, Twitter [email protected]

http://www.stat.berkeley.edu/~ledell