Upload
srisatish-ambati
View
1.114
Download
0
Embed Size (px)
Citation preview
H2O Rains with Databricks Cloud
Michal Malohlava @mmalohlava
Meetup 2016/02/04, SF
Who Am I?Background
• PhD in CS from Charles University in Prague, Czech Republic
• Postdoc at Purdue University experimenting with algos for large-scale computation
• Now SW engineer at H2O.ai
Experience with domain-specific languages, distributed system, software engineering,
and big data.
H2O.aiH
2O team
Sri Ambati Cliff ClickCo-
Foun
ders
Stephen Boyd
Rob Tibshirani
TrevorHastie
Scie
ntifi
cA
dvis
ory
Cou
ncil
H2OOpen-Source In-Memory Data Science Platform
• Highly optimized Java code (in-house)
• Distributed in-memory K-V store and map/reduce computation framework
• Data parser (HDFS, S3, NFS, HTTP, local drives, etc.)
• Read/write access to distributed data frames (R/Pandas-style)
• ML algos - Deep Learning, GBM, DRF, GLM, GLRM, K-Means, PCA, CoxPH, Ensembles
• REST API: clients Interactive UI/R/Python
H2O+Spark = Sparkling
Water
Open-source distributed execution platform
User-friendly API for data transformation based on RDDs, DataFrames (from 1.4) and DataSets (from 1.6)
Platform components - SQL, MLLib, text mining, Avro, Redshift, Kinesis.
Easily extendable by 3rd party packages Interactive shell
Current release 1.6Supported releases 1.3, 1.4, 1.5
DatabricksDatabricks • founded by the creators of Apache Spark • still contribute 75% of the code to the Spark project • cloud platform for running Spark in your AWS account
Databricks Platform • integrated collaborative data
science workspace • notebook interface inspired by
iPython and Zeplin but purpose built for Spark
• self service cluster manager and job scheduler for production Spark workloads
Sparkling WaterProvides
Transparent integration of H2O with Spark ecosystem
Transparent use of H2O data structures and algorithms with Spark API
Platform for building Smarter Applications
Excels in existing Spark workflows requiring advanced Machine Learning algorithms
Functionality missing in H2O can be replaced by Spark and vice versa
How to use Sparkling Water?
Model Building
Data Source
Data munging Modelling
Deep Learning, GBMDRF, GLM, GLRM
K-Means, PCACoxPH, Ensembles
Prediction processing
Data Munging
Data Source
Data load/munging/ exploration Modelling
Stream processing
DataSourceO
ff-lin
e m
odel
trai
ning
Data munging
Model prediction
Deploy the model
Stre
ampr
oces
sing
Data Stream
Spark Streaming/Storm
Export modelin a binary format
or as code
Modelling
What is inside?
Databricks
Worker node
Spark executor
Scala/Py main program
Driver node
H2OContext
SparkContext
Worker node
Spark executor
Worker node
Spark executor
H2O
Ser
vice
sH
2O S
ervi
ces
Data Source
Spar
k Ex
ecut
orSp
ark
Exec
utor
Spar
k Ex
ecut
or
Spark Cluster
DataFrame
H2O
Ser
vice
s
H2OFrame
Data Source
h2oContext.asDataFrame
h2oContext.asH2OFrame
DEMO Time!
What do we need?Databricks account (14 day free trial at www.databricks.com)
AWS account
Sparkling Water coordinates: ai.h2o:sparkling-water-examples_2.10:1.5.10
And some cool machine learning idea!
OR
Detect spam text messages
Data sample
Goal
For a given text message
identify if it is spam or not
Machine Learning Workflow
1. Extract data
2. Transform, tokenize messages
3. Build Tf-IDF model
4. Create and evaluate Deep Learning model
5. Use the model to detect spam
Checkout H2O.ai Training Books
http://learn.h2o.ai/
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai/sparkling-water
Meetups
https://meetup.com/
More info
Learn more at h2o.ai Follow us at @h2oai
Thank you!Sparkling Water is
open-source ML application platform
combining power of Spark and H2O