30
Copyright 2015 CATENATE Group – All rights reserved H20 - Thirst for Machine Learning Meetup Machine Learning/Data Science, Rome, 15 March 2017 Gabriele Nocco, Senior Data Scientist [email protected] Catenate s.r.l.

H20 - Thirst for Machine Learning

Embed Size (px)

Citation preview

Page 1: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H20 - Thirst for Machine LearningMeetup Machine Learning/Data Science, Rome, 15 March 2017Gabriele Nocco, Senior Data [email protected]

Catenate s.r.l.

Page 2: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

● H2O Introduction

● GBM

● Demo

2

AGENDA

Page 3: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

● H2O Introduction

● GBM

● Demo

3

AGENDA

Page 4: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION

H2O is an opensource in-memory Machine Learning engine. Java-based, it exposes comfortable APIs in Java, Scala, Python and R. It also has a notebook-like user interface called Flow.

The transversality of languages enables the access to the framework for many different professional roles, from analysts to programmers, up to more “academic” data scientists. So H2O can be a complete infrastructure, from the prototype model to the engineering solution.

Page 5: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - GARTNER

In 2017, H2O.ai became a Visionary in the Magic Quadrant for Data Science Platforms:

STRENGTHS● Market awareness● Customer satisfaction● Flexibility and scalability

CAUTIONS● Data access and preparation● High technical bar for use● Visualization and data

exploration● Sales execution

https://www.gartner.com/doc/reprints?id=1-3TKPVG1&ct=170215&st=sb

Page 6: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - FEATURES

● H2O Eco-System Benefits:○ Scalable to massive datasets on large clusters, fully parallelized○ Low-latency Java (“POJO”) scoring code is auto-generated○ Easy to deploy on Laptop, Server, Hadoop cluster, Spark cluster, HPC○ APIs include R, Python, Flow, Scala, Java, Javascript, REST

● Regularization techniques: Dropout, L1/L2● Early stopping, N-fold cross-validation, Grid search● Handling of categorical, missing and sparse data● Gaussian/Laplace/Poisson/Gamma/Tweedie regression with offsets, observation

weights, various loss functions● Unsupervised mode for nonlinear dimensionality reduction, outlier detection● File type allowed: csv, ORC, SVMLite, ARFF, XLS, XLSX, Avro, Parquet

Page 7: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - ALGORITHMS

Page 8: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - ARCHITECTURE

Page 9: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - ARCHITECTURE

Page 10: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O has the ability to develop Deep Neural Networks natively, or through integration with TensorFlow. It is now possible to produce very deep networks (5 to 1000 layers!) and it is possible to handle huge amounts of data, in the order of GBs or TBs.

Another great advantage is the ability to exploit the potential of GPU to perform computations.

H2O INTRODUCTION - H2O + TENSORFLOW

Page 11: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

With the release of TensorFlow, H2O has embraced the wave of enthusiasm for the growth of Deep Learning.

Thanks to Deep Water, H2O allows us to interact in a direct and simple way with Deep Learning tools like TensorFlow, MXNet and Caffe.

H2O INTRODUCTION - H2O + TENSORFLOW

Page 12: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - ARCHITECTURE

Page 13: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

H2O INTRODUCTION - H2O + SPARK

One of the first plugin developed in H2O was the one for Apache Spark, named Sparkling Water.

Binding to an opensource project on the rise such as Spark, with the power of calculation that distributed computing allows, has been a great driving force for the growth of H2O.

Page 14: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

A Sparkling Water application runs like a job that can be started with spark-submit.

At this point the Spark Master produces the DAG and divides the execution for each Worker, in which the H2O libraries are loaded in the Java process.

H2O INTRODUCTION - H2O + SPARK

Page 15: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

The Sparkling Water solution is obviously certificated for all the Spark distributions: Hortonworks, Cloudera, MapR.

Databricks provides a Spark cluster in cloud, and H2O works perfectly in this environment. H2O Rains with Databricks Cloud!

H2O INTRODUCTION - H2O + SPARK

Page 16: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

● H2O Introduction

● GBM

● Demo

16

AGENDA

Page 17: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

Gradient Boosting Machine is one of the most powerful techniques to build predictive models. It can be applied for classification or regression, so it’s a supervised algorithm.

This is one of the most diffused and used algorithm in the Kaggle community, performing better than SVMs, Decision Trees and Neural Networks in a large number of cases.

https://www.quora.com/Why-does-Gradient-boosting-work-so-well-for-so-many-Kaggle-problems

GBM can be an optimal solution when the dimension of the dataset or the computing power doesn’t allow to train a Deep Neural Network.

GBM

Gradient Boosting Machine

Page 18: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

Kaggle is the biggest platform for Machine Learning contests in the world.

https://www.kaggle.com/

In the beginning of March 2017, Google announces the acquisition of the Kaggle community.

GBM - KAGGLE

Page 19: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - ORIGIN OF BOOSTING IDEA

A weak learner is an algorithm whose performance is only marginally better than random chance. Boosting was developed in the 1980s as the answer to the following question: “can we combine many weak learners to create a very strong one?”

Boosting revolves around filtering observations, by focusing new learners on samples that were difficult to classify by previous weak learners.

Using this idea, we can train a succession of weak learning methods, each one focused on patterns that were misclassified previously.

Origin of Boosting Idea

Page 20: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - ADABOOST

The first algorithm to gain a large popularity in the boosting family was Adaptive Boosting or AdaBoost for short. In the original formulation, weak learners are made by decision trees with a single split, called decision stumps.

AdaBoost works by weighting the observations, and sampling the dataset at each iteration with more emphasis on instances that are difficult to classify. Sequentially, we add stumps trying to classify them better.

At every step, predictions are made by taking a majority vote of the weak learners’ outputs, weighted by a measure of their individual accuracy.

The First Boosting Algorithm

Page 21: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

In later years, it was realized that AdaBoost can be derived formally as the minimization of a specific cost function with an exponential loss. This allowed to recast the algorithm under a statistical framework.

Gradient Boosting Machines, later called just gradient boosting (or gradient tree boosting when using trees), are the natural generalization of AdaBoost to handle boosting with any loss function following a gradient descent procedure:

GBM = Boosting + Gradient descent

This class of algorithms remains stage-wise additive, since new learners are added iteratively while old ones are kept fixed. The generalization allows arbitrary differentiable loss functions to be used, providing more flexible algorithms to handle regression, multi-class classification and more.

Generalization of AdaBoost as Gradient Boosting

Page 22: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

Summarizing, GBM requires to specify three different components:

● The loss function with respect to the new weak learners.

● The specific form of the weak learner (e.g., stumps).

● A technique to add weak learners between them to minimize the loss

function.

How Gradient Boosting Works

Page 23: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

The loss function determines the behavior of the algorithm.

The only requirement is differentiability, in order to allow gradient descent on it. Although you can define arbitrary losses, in practice only a handful are used. For example, regression may use a squared error and classification may use logarithmic loss.

Loss Function

Page 24: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

In H2O, the weak learners are implemented as decision trees, making this an instance of decision tree boosting. In order to allow the addition of their outputs, regression trees (having real values in output) are used.

When building each decision tree, the algorithm iteratively selects a split point in a greedy fashion based on a measure of “purity” of the dataset, in order to minimize the loss. It is possible to increase the depth of the trees to handle more flexible decision boundaries.

On the contrary, to limit overfitting we can constrain the topology of tree by, e.g. limiting the depth, the number of splits, or the number of leaf nodes.

Weak Learner

Page 25: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

Gradient descent is a generic iterative technique to minimize objective functions. At each iteration, the gradient of the loss function (e.g., the error on the training set) is computed, and it is used to choose a set of parameters that decreases its value.

In a GBM, the optimization problem is formulated in terms of functions such as trees (functional optimization), making it relatively hard in general. The basic idea is to approximate this gradient using only its values on our training points.

In a GBM with squared loss, the resulting algorithm is extremely simple: at each step we train a new tree on the “residual errors” with respect to the previous weak learners. This can be seen as a gradient descent step with respect to our loss, where all previous weak learners are kept fixed and the gradient is approximated. This generalizes easily to different losses.

Additive Model

Page 26: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

The output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model. In particular, we associate a different weighting parameter to each decision region of the newly constructed tree. This is done by solving a new optimization problem with respect to these weights.

A fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.

Output and Stop Condition

Page 27: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

GBM - GRADIENT BOOSTING

Gradient boosting is a greedy algorithm and can overfit a training dataset quickly.

It can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.

There are 4 enhancements to basic gradient boosting:● Tree Constraints● Learning Rate● Stochastic Gradient Boosting● Penalized Learning (Regularization of regression trees output in L1 or L2)

Improvements to Basic Gradient Boosting

Page 28: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

● H2O Introduction

● GBM

● Demo

28

AGENDA

Page 29: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved

Q&A

Page 30: H20 - Thirst for Machine Learning

Copyright 2015 CATENATE Group – All rights reserved