67
Paco Nathan Concurrent, Inc. San Francisco, CA @pacoid “Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop” 1 Tuesday, 25 June 13

Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Embed Size (px)

DESCRIPTION

Hadoop Summit 2013 talk: “Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop”

Citation preview

Page 1: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Paco NathanConcurrent, Inc.San Francisco, CA@pacoid

“Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop”

1Tuesday, 25 June 13

Page 2: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

2Tuesday, 25 June 13

Page 3: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading – origins

API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products.

Wensel was following the Nutch open source project – where Hadoop started.

Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology.

3Tuesday, 25 June 13

Page 4: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:

• leverages JVM and Java-based tools without anyneed to create new languages

• allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters

4Tuesday, 25 June 13

Page 5: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading – definitions

• a pattern language for Enterprise Data Workflows

• simple to build, easy to test, robust in production

• design principles ⟹ ensure best practices at scale

5Tuesday, 25 June 13

Page 6: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading – usage

• Java API, DSLs in Scala, Clojure, Jython, JRuby, Groovy, ANSI SQL

• ASL 2 license, GitHub src, http://conjars.org

• 5+ yrs production use, multiple Enterprise verticals

6Tuesday, 25 June 13

Page 7: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading – integrations

• partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera

• taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc.

• serialization: Avro, Thrift, Kryo, JSON, etc.

• topologies: Apache Hadoop, tuple spaces, local mode

7Tuesday, 25 June 13

Page 8: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

8Tuesday, 25 June 13

Page 9: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

workflow abstraction addresses: • staffing bottleneck; • system integration; • operational complexity; • test-driven development

9Tuesday, 25 June 13

Page 10: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

10Tuesday, 25 June 13

Page 11: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

Let’s consider a “strawman” architecture for an example app… at the front end

LOB use cases drive demand for apps

11Tuesday, 25 June 13

Page 12: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

Same example… in the back office

Organizations have substantial investmentsin people, infrastructure, process

12Tuesday, 25 June 13

Page 13: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Enterprise Data Workflows

Same example… the heavy lifting!

“Main Street” firms are migratingworkflows to Hadoop, for cost savings and scale-out

13Tuesday, 25 June 13

Page 14: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – taps

• taps integrate other data frameworks, as tuple streams

• these are “plumbing” endpoints in the pattern language

• sources (inputs), sinks (outputs), traps (exceptions)

• text delimited, JDBC, Memcached, HBase, Cassandra, MongoDB, etc.

• data serialization: Avro, Thrift, Kryo, JSON, etc.

• extend a new kind of tap in just a few lines of Java

schema and provenance get derived from analysis of the taps

14Tuesday, 25 June 13

Page 15: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading workflows – taps

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

source and sink tapsfor TSV data in HDFS

15Tuesday, 25 June 13

Page 16: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – topologies

• topologies execute workflows on clusters

• flow planner is like a compiler for queries

- Hadoop (MapReduce jobs)

- local mode (dev/test or special config)

- in-memory data grids (real-time)

• flow planner can be extended to support other topologies

blend flows in different topologies into the same app – for example,batch (Hadoop) + transactions (IMDG)

16Tuesday, 25 June 13

Page 17: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading workflows – topologies

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

flow planner for Apache Hadoop topology

17Tuesday, 25 June 13

Page 18: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading workflows – test-driven development

• assert patterns (regex) on the tuple streams

• adjust assert levels, like log4j levels

• trap edge cases as “data exceptions”

• TDD at scale:

1.start from raw inputs in the flow graph

2.define stream assertions for each stage of transforms

3.verify exceptions, code to remove them

4.when impl is complete, app has full test coverage

redirect traps in production to Ops, QA, Support, Audit, etc.

18Tuesday, 25 June 13

Page 19: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Workflow Abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations within the flows bring functional programming aspects into Java

In formal terms, this provides a pattern language

19Tuesday, 25 June 13

Page 20: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Pattern Language

structured method for solving large, complex design problems, where the syntax of the language ensures the use of best practices – i.e., conveying expertise

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199

20Tuesday, 25 June 13

Page 21: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Workflow Abstraction – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

in formal terms, flow diagrams leverage a methodology called literate programming

provides intuitive, visual representations for apps –great for cross-team collaboration

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

21Tuesday, 25 June 13

Page 22: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Literate Programming

by Don Knuth

Literate ProgrammingUniv of Chicago Press, 1992

literateprogramming.com/

“Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.”

22Tuesday, 25 June 13

Page 23: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Workflow Abstraction – business process

following the essence of literate programming, Cascading workflows provide statements of business process

this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)

this is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

23Tuesday, 25 June 13

Page 24: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Business Process

by Edgar Codd

“A relational model of data for large shared data banks”Communications of the ACM, 1970 dl.acm.org/citation.cfm?id=362685

rather than arguing between SQL vs. NoSQL…structured vs. unstructured data frameworks… this approach focuses on what apps do:

the process of structuring data

24Tuesday, 25 June 13

Page 25: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Cascading – functional programming

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki

Why Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes

forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/

25Tuesday, 25 June 13

Page 26: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Functional Programming for Big Data

WordCount with token scrubbing…

Apache Hive: 52 lines HQL + 8 lines Python (UDF)

compared to

Scalding: 18 lines Scala/Cascading

functional programming languages help reduce software engineering costs at scale, over time

26Tuesday, 25 June 13

Page 27: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Two Avenues to the App Layer…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

27Tuesday, 25 June 13

Page 28: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

28Tuesday, 25 June 13

Page 29: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

• established XML standard for predictive model markup

• organized by Data Mining Group (DMG), since 1997 http://dmg.org/

• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc.

• PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows

“PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.”

PMML – standard

wikipedia.org/wiki/Predictive_Model_Markup_Language

29Tuesday, 25 June 13

Page 30: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

• Association Rules: AssociationModel element

• Cluster Models: ClusteringModel element

• Decision Trees: TreeModel element

• Naïve Bayes Classifiers: NaiveBayesModel element

• Neural Networks: NeuralNetwork element

• Regression: RegressionModel and GeneralRegressionModel elements

• Rulesets: RuleSetModel element

• Sequences: SequenceModel element

• Support Vector Machines: SupportVectorMachineModel element

• Text Models: TextModel element

• Time Series: TimeSeriesModel element

PMML – model coverage

ibm.com/developerworks/industry/library/ind-PMML2/

30Tuesday, 25 June 13

Page 31: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

PMML – vendor coverage

31Tuesday, 25 June 13

Page 32: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

32Tuesday, 25 June 13

Page 33: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Pattern – model scoring

• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML

• great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc.

• integrate with other libraries –Matrix API, etc.

• leverage PMML as another kind of DSL

cascading.org/pattern

33Tuesday, 25 June 13

Page 34: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

## train a RandomForest model f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance)print(fit) predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

Pattern – create a model in R

34Tuesday, 25 June 13

Page 35: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

<?xml version="1.0"?><PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>  <Application name="Rattle/PMML" version="1.2.30"/>  <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4">  <DataField name="label" optype="categorical" dataType="string">   <Value value="0"/>   <Value value="1"/>  </DataField>  <DataField name="var0" optype="continuous" dataType="double"/>  <DataField name="var1" optype="continuous" dataType="double"/>  <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification">  <MiningSchema>   <MiningField name="label" usageType="predicted"/>   <MiningField name="var0" usageType="active"/>   <MiningField name="var1" usageType="active"/>   <MiningField name="var2" usageType="active"/>  </MiningSchema>  <Segmentation multipleModelMethod="majorityVote">   <Segment id="1">    <True/>    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">     <MiningSchema>      <MiningField name="label" usageType="predicted"/>      <MiningField name="var0" usageType="active"/>      <MiningField name="var1" usageType="active"/>      <MiningField name="var2" usageType="active"/>     </MiningSchema>...

Pattern – capture model parameters as PMML

35Tuesday, 25 June 13

Page 36: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );  // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath );  // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();  OptionSet options = optParser.parse( args );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );  if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }  // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }

Pattern – score a model, within an app

36Tuesday, 25 June 13

Page 37: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

CustomerOrders

Classify ScoredOrders

GroupBytoken

Count

PMMLModel

M R

FailureTraps

Assert

ConfusionMatrix

Pattern – score a model, using pre-defined Cascading app

cascading.org/pattern

37Tuesday, 25 June 13

Page 38: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

## run an RF classifier at scale hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \ --pmml data/sample.rf.xml 

## run an RF classifier at scale, assert regression test, measure confusion matrix hadoop jar build/libs/pattern.jar data/sample.tsv out/classify out/trap \ --pmml data/sample.rf.xml --assert --measure out/measure

 ## run a predictive model at scale, measure RMSE hadoop jar build/libs/pattern.jar data/iris.lm_p.tsv out/classify out/trap \ --pmml data/iris.lm_p.xml --rmse out/measure

Pattern – score a model, using pre-defined Cascading app

38Tuesday, 25 June 13

Page 39: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Roadmap – existing algorithms for scoring

• Random Forest

• Decision Trees

• Linear Regression

• GLM

• Logistic Regression

• K-Means Clustering

• Hierarchical Clustering

• Multinomial

• Support Vector Machines (prepared for release)

also, model chaining and general support for ensembles

cascading.org/pattern

39Tuesday, 25 June 13

Page 40: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Roadmap – next priorities for scoring

• Time Series (ARIMA forecast)

• Association Rules (basket analysis)

• Naïve Bayes

• Neural Networks

algorithms extended based on customer use cases – contact groups.google.com/forum/?fromgroups#!forum/pattern-user

cascading.org/pattern

40Tuesday, 25 June 13

Page 41: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Roadmap – top priorities for creating models at scale

• Random Forest

• Logistic Regression

• K-Means Clustering

• Association Rules

…plus all models which can be trained via sparse matrix factorization (TQSR => PCA, SVD least squares, etc.)

a wealth of recent research indicates many opportunities to parallelize popular algorithms for training models at scale on Apache Hadoop…

cascading.org/pattern

41Tuesday, 25 June 13

Page 42: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

42Tuesday, 25 June 13

Page 43: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Experiments – comparing models

• much customer interest in leveraging Cascading and Apache Hadoop to run customer experiments at scale

• run multiple variants, then measure relative “lift”

• Concurrent runtime – tag and track models

the following example compares two models trained with different machine learning algorithms

this is exaggerated, one has an important variable intentionally omitted to help illustrate the experiment

43Tuesday, 25 June 13

Page 44: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

## train a Random Forest model## example: http://mkseo.pe.kr/stats/?p=220 f <- as.formula("as.factor(label) ~ var0 + var1 + var2")fit <- randomForest(f, data=data, proximity=TRUE, ntree=25)print(fit)saveXML(pmml(fit), file=paste(out_folder, "sample.rf.xml", sep="/"))

Experiments – Random Forest model

OOB estimate of error rate: 14%Confusion matrix: 0 1 class.error0 69 16 0.18823531 12 103 0.1043478

44Tuesday, 25 June 13

Page 45: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

## train a Logistic Regression model (special case of GLM)## example: http://www.stat.cmu.edu/~cshalizi/490/clustering/clustering01.r f <- as.formula("as.factor(label) ~ var0 + var2")fit <- glm(f, family=binomial, data=data)print(summary(fit))saveXML(pmml(fit), file=paste(out_folder, "sample.lr.xml", sep="/"))

Experiments – Logistic Regression model

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.8524 0.3803 4.871 1.11e-06 ***var0 -1.3755 0.4355 -3.159 0.00159 ** var2 -3.7742 0.5794 -6.514 7.30e-11 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

NB: this model has “var1” intentionally omitted

45Tuesday, 25 June 13

Page 46: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Experiments – comparing results

• use a confusion matrix to compare results for the classifiers

• Logistic Regression has a lower “false negative” rate (5% vs. 11%)however it has a much higher “false positive” rate (52% vs. 14%)

• assign a cost model to select a winner –for example, in an ecommerce anti-fraud classifier:

FN ∼ chargeback risk FP ∼ customer support costs

46Tuesday, 25 June 13

Page 47: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

47Tuesday, 25 June 13

Page 48: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Two Cultures

“A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”

Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L

in other words, seeing the forest for the trees…

this paper chronicled a sea change from data modeling practices(silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization)

48Tuesday, 25 June 13

Page 49: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Why Do Ensembles Matter?

The World…per Data Modeling

The World…

49Tuesday, 25 June 13

Page 50: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Algorithmic Modeling

“The trick to being a scientist is to be open to using a wide variety of tools.” – Breiman

circa 2001: Random Forest, bootstrap aggregation, etc., yield dramatic increases in predictive power over earlier modeling such as Logistic Regression

major learnings from the Netflix Prize: the power of ensembles, model chaining, etc.

the problems at hand have become simply too big and too complex for ONE distribution, ONE model, ONE team…

50Tuesday, 25 June 13

Page 51: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Ensemble Models

Breiman: “a multiplicity of data models”

BellKor team: 100+ individual models in 2007 Progress Prize

while the process of combining models adds complexity (making it more difficult to anticipate or explain predictions) accuracy may increase substantially

Ensemble Learning: Better Predictions Through DiversityTodd HollowayETech (2008)abeautifulwww.com/EnsembleLearningETech.pdf

The Story of the Netflix Prize: An Ensemblers TaleLester MackeyNational Academies Seminar, Washington, DC (2011)stanford.edu/~lmackey/papers/

51Tuesday, 25 June 13

Page 52: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

KDD 2013 PMML Workshop

Pattern: PMML for Cascading and HadoopPaco Nathan, Girish KathalagiriChicago, 2013-08-11 (accepted)

19th ACM SIGKDD Conference on Knowledge Discovery and Data Miningkdd13pmml.wordpress.com

52Tuesday, 25 June 13

Page 53: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

Cascading: backgroundThe Workflow AbstractionPMML: Predictive Model MarkupPattern: PMML in CascadingPMML for Customer ExperimentsEnsemble Models with PatternWorkflow Design Pattern

53Tuesday, 25 June 13

Page 54: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

54Tuesday, 25 June 13

Page 55: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

ANSI SQL for ETL

55Tuesday, 25 June 13

Page 56: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

56Tuesday, 25 June 13

Page 57: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive models

57Tuesday, 25 June 13

Page 58: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

58Tuesday, 25 June 13

Page 59: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

most of the project costs…

59Tuesday, 25 June 13

Page 60: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

a compiler sees it all…

cascading.org

60Tuesday, 25 June 13

Page 61: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );

cascading.org

61Tuesday, 25 June 13

Page 62: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );

62Tuesday, 25 June 13

Page 63: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

cascading.orgETL data

preppredictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

visual collaboration for the business logic is a great way to improve how teams work together

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

63Tuesday, 25 June 13

Page 64: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

multiple departments, working in their respective

frameworks, integrate results into a combined app,

which runs at scale on a cluster… business process

combined in a common space (DAG) for flow

planners, compiler, optimization, troubleshooting,

exception handling, notifications, security audit,

performance monitoring, etc.

cascading.org

64Tuesday, 25 June 13

Page 65: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Enterprise Data Workflowswith Cascading

O’Reilly, 2013amazon.com/dp/1449358721

references…

newsletter updates:

liber118.com/pxn/

65Tuesday, 25 June 13

Page 66: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

Many thanks to others who have contributed code, ideas, suggestions, etc., to Pattern:

• Chris Wensel @ Concurrent

• Girish Kathalagiri @ AgilOne

• Vijay Srinivas Agneeswaran @ Impetus

• Chris Severs @ eBay

• Ofer Mendelevitch @ Hortonworks

• Sergey Boldyrev @ Nokia

• Quinton Anderson @ IZAZI Solutions

• Chris Gutierrez @ Airbnb

• Villu Ruusmann @ JPMML project

acknowledgements…

66Tuesday, 25 June 13

Page 67: Hadoop Summit: Pattern – an open source project for migrating predictive models from SAS, etc., onto Hadoop

blog, developer community, code/wiki/gists, maven repo, commercial products, etc.:

cascading.org

zest.to/group11

github.com/Cascading

conjars.org

goo.gl/KQtUL

concurrentinc.com

drill-down…

67Tuesday, 25 June 13