View
7.196
Download
2
Category
Preview:
Citation preview
OCT 29th, 2014 WEBINAR
H2O
Fast. Scalable. Machine LearningFor Smarter Applications
“Fluids are In, Animals are Out.”
~ Svetlana Sicular, Gartner
SpeakersJoel Horwitz
Joel is a caffeine, data, and laughter driven product strategist. He is an active community member having founded Bay Area Analytics, Tweets regularly @JSHorwitz, blogs regularly joelshorwitz.com and speaks regularly at industry events. Always eager to learn and lend a helping hand makes him an invaluable asset to 0xdata.
Michal Malohlava
Michal is a geek, developer, Java, Linux, programming languages enthusiast developing software for over 10 years. He obtained PhD from the Charles University in Prague in 2012 and post-doc at Purdue University.
H2O World Register at http://www.0xdata.com/h2o-world
Time is the only non-renewable resource.
Speed Matters!
Today
• Why Are We Here
• Who We Are
• How Do We Do It
• Who We Work With
• What We Believe
• Demo and Q&A
A New Interpretation of Moore’s Law
“Like the physical universe, the digital universe is large - by 2020 containing nearly as many digital bits as there are stars in the universe. It is doubling in size every two years, and by 2020 the digital universe -
the data we create and copy annually - will reach 44 zettabytes, or 44 trillion gigabytes.”
- IDC 2014
An Evolving Market to Meet the Demand
RDBMS MPP
BusinessIntelligence
Data Science
MachineLearning
H2O
DistributedFile System
Decreasing Cost of Data is Driving Demand
$ / GB (-50% every 18 months)
Algorithm Sophistication
1970 1980 1990 2000 2010 2020
H2O
H2O is the First DedicatedMachine Learning Open Source PlatformH2O is for application developers and analysts who
need scalable and fast machine learning. H2O is an
open source predictive analytics platform. Unlike
traditional analytics tools, H2O provides a combination
of extraordinary math, a high performance parallel
architecture, and unrivaled ease of use.
Who Are We?
H2O
H2O Awards and Accolades
• Top R Project of UserR Conference 2014
• Fortune Big Data All-Stars 2014, Arno Candel
• 100+ Meetups
• 6000+ Users
H2O is Built for Speed and Scale
• OpenSource
• REST API
• Native R Support
• NanoFastTM Scoring Engine
• Sophisticated Algorithms
H2O Seamlessly Integrates with Your Workflow
• 20X Faster Imports and 3X Compression w/ .hex format.
• 4 Billion Row Regression in Seconds.
• Deploy in POJO or with our REST API
Code is incomplete without Community!
Open Source Drives Innovation.
Law of Large Numbers Triumphs!
Every Generation Needs to Invent Its Own Math.
ML is the new SQL!
What do our customers say about us?"The platform can generate Jar files to deploy models into production. This alone is a milestone!" - Hassan Namarvar, ShareThis
“I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.” – Nachum Shacham, Paypal
“I analyzed 1 million rows training set, fitting a logistic regression with elastic penalty, and doing a grid search on parameters with 10-fold cross validation for each parameter combination… doing this analysis was a breeze… orders of magnitude faster than R.” - Antonio Molins, Netflix
“Never have we had such a quick, simple, scalable and cost effective deployment solution for predictive modeling.” – Lou Carvalheira, Cisco
AdvertisingBetter Conversions
Conversion ReachBrand ROI
Overall, I would say that the H2O platform is the most elegant open source in-memory ML engine.
~ Hassan Namarvar, Principal Data Scientist
FraudBetter Detections
Purchase
I have to give credit to H2O. They have a very complete way of showing which algorithm is the best.
~ Nachum Shacham, Principal Data Scientist
Shopping Theft Passwords
MarketingBetter Spend
ROI
H2O has established a new equilibrium point for performance,accuracy and cost for statistics and machine learning.
~ Lou Carvalheira, Principal Data Scientist
Network Segments Measure
Select Customers Powered by H2O
Sparkling Water“Killer App for Spark”
@hexadata & @mmalohlavapresents
Memory efficient
Performance of computation
Machine learning algorithms
Parser, GUI, R-interface
User-friendly API
Large and active community
Platform components - SQL
Multitenancy
Sparkling Water
+RDD
immutableworld
DataFramemutableworld
Sparkling Water
RDD DataFrame
Sparkling Water
Provides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and algorithms with Spark API
Excels in Spark workflows requiring advanced Machine Learning algorithms
Sparkling Water Design
spark-submitSpark
MasterJVM
SparkWorker
JVM
SparkWorker
JVM
SparkWorker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Contains applicationand Sparkling Water
classes
Sparkling Appjar file
implements
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMDataSource
(e.g. HDFS)
H2ORDD
Spark Executor JVM
Spark Executor JVM
SparkRDD
RDDs and DataFramesshare same memory
space
Demo Time
Flight delay prediction
“Build a model using weather and flight data to predict
delays of flights arriving to Chicago O’Hare International
Airport”
Example Outline
Load & Parse CSV data from 2 data sources
Use Spark API to filter data, do SQL query for join
Create a regression model
Use model for delay prediction
Plot residual plot from R
Sparkling Water Requirements
Linux or Mac OS X
Oracle Java 1.7
Downloadhttp://0xdata.com/download/
Install and Launch
Unpack zip file
and
Point SPARK_HOME to your Spark installation
and
Launch h2o-examples/sparkling-shell
What is Sparkling Shell?
Standard spark-shell
With additional Sparkling Water classes
export MASTER=“local-cluster[3,2,1024]”
spark-shell \ —-jars sparkling-water.jar JAR containing
Sparkling Water
Spark Masteraddress
Lets play with Sparkling shell…
Create H2O Client
import org.apache.spark.h2o._import org.apache.spark.examples.h2o._
val h2oContext = new H2OContext(sc).start(3)import h2oContext._
Regular Spark contextprovided by Spark shell
Size of demandedH2O cloud
Contains implicit utility functions Demo specificclasses
Is Spark Running?Go to http://localhost:4040
Is H2O running?http://localhost:54321/steam/index.html
Load Data #1Load weather data into RDD
val weatherDataFile = “examples/smalldata/weather.csv"
val wrawdata = sc.textFile(weatherDataFile,3).cache()
val weatherTable = wrawdata.map(_.split(“,")).map(row => WeatherParse(row)).filter(!_.isWrongRow())
Regular Spark API
Ad-hoc Parser
Weather Data
case class Weather( val Year : Option[Int], val Month : Option[Int], val Day : Option[Int], val TmaxF : Option[Int], // Max temperatur in F val TminF : Option[Int], // Min temperatur in F val TmeanF : Option[Float], // Mean temperatur in F val PrcpIn : Option[Float], // Precipitation (inches) val SnowIn : Option[Float], // Snow (inches) val CDD : Option[Float], // Cooling Degree Day val HDD : Option[Float], // Heating Degree Day val GDD : Option[Float]) // Growing Degree Day
Simple POSO to hold one row of weather data
Load Data #2Load flights data into H2O frame
import java.io.File
val dataFile = “examples/smalldata/allyears2k_headers.csv.gz"
val airlinesData = new DataFrame(new File(dataFile))
Where is the data?Go to http://localhost:54321/steam/index.html
Use Spark API for Data Filtering
// Create RDD wrapper around DataFrameval airlinesTable : RDD[Airlines]
= toRDD[Airlines](airlinesData)
// And use Spark RDD API directlyval flightsToORD = airlinesTable
.filter( f => f.Dest == Some(“ORD") )
Regular SparkRDD call
Create a cheap wrapperaround H2O DataFrame
Use Spark SQL to Data Join
import org.apache.spark.sql.SQLContext// We need to create SQL context val sqlContext = new SQLContext(sc)import sqlContext._
flightsToORD.registerTempTable("FlightsToORD")weatherTable.registerTempTable("WeatherORD")
Join Data based on Flight Date
val bigTable = sql( """SELECT | f.Year,f.Month,f.DayofMonth, | f.CRSDepTime,f.CRSArrTime,f.CRSElapsedTime, | f.UniqueCarrier,f.FlightNum,f.TailNum, | f.Origin,f.Distance, | w.TmaxF,w.TminF,w.TmeanF, | w.PrcpIn,w.SnowIn,w.CDD,w.HDD,w.GDD, | f.ArrDelay | FROM FlightsToORD f | JOIN WeatherORD w | ON f.Year=w.Year AND f.Month=w.Month | AND f.DayofMonth=w.Day""".stripMargin)
Launch H2O Algorithmsimport hex.deeplearning._import hex.deeplearning.DeepLearningModel
.DeepLearningParameters
// Setup deep learning parametersval dlParams = new DeepLearningParameters()dlParams._train = bigTabledlParams._response_column = 'ArrDelaydlParams._classification = false
// Create a new model builderval dl = new DeepLearning(dlParams)
val dlModel = dl.train.get
Result of SQL query
Blocking call
Make a prediction
// Use model to score dataval prediction = dlModel.score(result)(‘predict)
// Collect predicted values via RDD APIval predictionValues = toRDD[DoubleHolder](prediction) .collect .map ( _.result.getOrElse("NaN") )
Generate Residuals Plot# Import H2O library and initialize H2O clientlibrary(h2o)
h = h2o.init()
# Fetch prediction and actual data, use remembered keyspred = h2o.getFrame(h, "dframe_b5f449d0c04ee75fda1b9bc865b14a69")act = h2o.getFrame (h, "frame_rdd_14_b429e8b43d2d8c02899ccb61b72c4e57")
# Select right columnspredDelay = pred$predictactDelay = act$ArrDelay
# Make sure that number of rows is same nrow(actDelay) == nrow(predDelay)
# Compute residuals residuals = predDelay - actDelay
# Plot residuals compare = cbind(
as.data.frame(actDelay$ArrDelay), as.data.frame(residuals$predict))
plot( compare[,1:2] )
Referencesof data
More infoCheckout 0xdata Blog for Sparkling Water tutorials
http://0xdata.com/blog/
Checkout 0xdata Youtube Channel
https://www.youtube.com/user/0xdata
Checkout github
https://github.com/0xdata/sparkling-water
Learn more about H2O at0xdata.com
or
Thank you!
Follow us at @hexadata
neo> for r in sparkling-water; do git clone “git@github.com:0xdata/$r.git”done
Recommended