Sparkling Water, ASK CRAIG

ML + H2OAlex Tellez & Michal Malohlava

www.h2o.ai

lib .ai

THE RED PILL (SPARK + ML)Finally, ONE TO RULE THEM ALL!

1. Scrape & Collect Data2. Cleanse Data + Feature Extraction / Engineering

3. Build Machine Learning Models + Iterate4. Throw More Data to Improve Model

5. Deploy Model(s) in Real-Time

THE BLUE PILL (H2O.AI)What is H2O? (water, duh!)

It is ALSO an open-source, distributed and parallel predictive engine for machine learning.

What makes H2O different?Cutting-edge algorithms + parallel architecture + ease-of-use

=Happy Data Scientists / Analysts

WHY NOT BOTH PILLS?!

Build smarter applications USING BOTH in harmony within the Spark Ecosystem !!!

Convert Spark RDDs H2O RDDs for Machine Learning

LET’S BUILD AN APP!

Task: Predict the job category from a Craigslist Ad Title

ML WORKFLOW1. Perform Feature Extraction on Words + Munging

2. Run Word2Vec algo (MLlib) on Job Title words

3. Create “title vectors” from individual word vectors for each job title

4. Pass the Spark RDD H2O RDD for ML in Flow

5. Run H2O GBM algorithm on H2O RDD

6. Create Spark Streaming Application + Score on new data

1. TEXT MUNGING

Example: “Site Supervisor and Pre K Teachers Needed Now!!!”

Post Tokenization: Seq(site, supervisor, pre, teachers, needed)

val tokens = jobTitles.map(line => token(line))

Next: Apply Spark’s Word2Vec model to each word

2. WORD2VECSimply: A mathematical way to represent a single word as a vector of numbers. These vector ‘representations’ encode information about the

about a given word (i.e. its meaning)

Post Tokenization: Seq(site, supervisor, pre, teachers, needed)

Post Word2Vec Results:

needed, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]

site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]

BUT THAT’S ON WORDS!Post Word2Vec Results:

site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]

WE NEED TITLE VECTORS BASED ON ALL THE WORDS!

Averaging word vectors to make ‘Title Vectors’

v(King) - v(Man) + V(Woman) ~ v(Queen)

3. TITLE VECTORSIn Steps:

1. Sum the word2vec vectors in a given title2. Divide this sum by # of words in a given title

Result: ~ Average vector for a given title of N words

site, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]supervisor, mllib.linalg.vector[0.456, 0.123, 0.678…….0.987]+

Divide by Total Words (post tokenization)

~ (site supervisor….needed), [0.998, 0.349, 0.621…….0.915]

4. PASS SPARK RDD TO H2O

OPEN H2O FLOW!

5. BUILD A MODEL!

80% ACCURACY - DEFAULT!Algo: Gradient Boosting Machine

# Trees: 50# Bins: 20Depth: 5

(ALL DEFAULT VALUES)

~ 20% Error Rate

6. SPARK STREAMING + DEPLOYMENT

Create Spark Streaming App to read in new Job Titles

a) Create a Spark Streaming Producer - Reads data from a file & generates events in real-time which we will predict category.

APP ARCHITECTURE

Posting job title

“HIRING Painting

CONTRACTORS NOW!!!”

StreamCategorize a job title

Prediction = “Labor”

Re-train the model

Craigslist jobs

Word2Vec Model

GBMModel

Word2Vec

Train a model

“ASK CRAIG” LIVE DEMO!

END-TO-ENDIn JUST 25 minutes…we:

1. Performed sophisticated feature extraction + engineering2. Passed a Spark RDD H2O RDD for ML3. Created a Spark Stream to read in new data

5. “Productionalized” H2O + Spark MLlib model to score on new dataSo happy I took

both pills!

4. Built a GBM to classify titles w/ 80% accuracy

TRY SPARKLING WATER!!Download @ h2o.ai

Coming Soon: Release 1.4 for Spark 1.4!

NEW GUI! H2O FLOWMeetup: Silicon Valley Big Data Science

Sparkling Water, ASK CRAIG

Software

SPARKLING INNOVATION

Sommelier Selections - Champagne/Sparkling - Sign In · Sommelier Selections - Champagne/Sparkling Additional Sparkling - By the Stem Champagne/Sparkling Wine. France, Champagne,

Sparkling Science

Sparkling Wine by the Glass Champagne and Sparkling - Bascom's

Legende - stbaro.bayern.de · sk!! ask!! ask!! ask!! ask!! ask!! ask!! ask!! ask!! ask!! ask!!a sk!! ask!! ask!!a sk!!a sk!! ask!! ask!! ask!! ask!! ask!! ask!! ask!! ask!! ask!!a

Vino Bianco, Rose and Sparkling Rose & Sparkling Vino Rosso

Sparkling Windows

Sparkling Brut Sparkling Brut Rosa Bombay Sapphire Gin & Tonic

Sparkling Wine and Definition of sparkling wine Brandy ...jhenderson/Sparkling.pdf · 4/29/2014 Sparkling & Brandy 2 Types of Sparkling Wine French sparkling wine made outside the

Sparkling White

Sparkling Water

Sparkling Sales Workshop

Sparkling SMP

MIWC EntryForm 2020 6-15-20 copy · 232.3 Sparkling Wines (Bulk or Charmat Process) Sparkling Rouge 232.6 Sparkling Wines (Bulk or Charmat Process) Prosecco 201 Sparkling Wines (Non-Vintage

Amazonia Sparkling

Sparkling Voice

Sparkling SMA

Seagar’s...Seagar’s Please ask to see our iPad Wine List By The Glass Sparkling 10032 Roederer Estate, "Brut Rosé" Champagne & Sparkling Wine • California, Mendocino County,

Ask the Experts Interview with Craig Fleisher & Babette Bensoussan Authors of Business and Competitive Analysis

Sparkling Nightingale collection