Ted Dunning, Chief Application Architect, MapR at MLconf ATL - 9/18/15

Preview:

Citation preview

© 2014 MapR Technologies 1© 2014 MapR Technologies

Cheap Learning Complements Deep Learning

Ted Dunning

© 2014 MapR Technologies 2

Me, Us• Ted Dunning, Chief Application Architect, MapR

– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG

• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s

• Info– Hash tag - #mapr #mlconfatl– See also - @ApacheDrill

@ted_dunning and @mapR

© 2014 MapR Technologies 3

Agenda• Rationale• Why cheap isn't the same as simple-minded• Some techniques• Examples

© 2014 MapR Technologies 4

Why is cheap better than deep (sometimes)Greenfield problems can be

– Easy (large number of these)– Impossible (large number of these)– Hard but possible (right on the boundary)

Mature problems can be– Easy (these are already done)– Impossible (still a large number of these)– Hard but possible (now the majority of the effort)

© 2014 MapR Technologies 5

Most data isn’t worth much in isolation

First data is valuable

Later data is dregs

© 2014 MapR Technologies 6

Suddenly worth processing

First data is valuable

Later data is dregs

But has high aggregate value

© 2014 MapR Technologies 7

If we can handle the scale

It’s really big

© 2014 MapR Technologies 8

With great scale comes great opportunity• Increasing scale by 1000x changes the game

• We essentially have green fields opening up all around

• Most of the opportunities don’t require advanced learning

© 2014 MapR Technologies 9

A simple example - security monitoring

• “Small” data– Capture IDS logs– Detect what you already know

• “Big” data– Capture switch, server, firewall logs as well– New patterns emerge immediately

© 2014 MapR Technologies 10

Another example – fraud detection

• “Small” data– Maintain card profiles– Segment models– Evaluate all transactions

• “Big” Data– Maintain card profiles, full 90 day transaction history– Per user hierarchical models– Evaluate all transactions

© 2014 MapR Technologies 11

Easy != Stupid• You still have to do things reasonably well

– Techniques that are not well founded are still problems

• Heuristic frequency ratios still fail – Coincidences still dominate the data– Accidental 100% correlations abound

• Related techniques still broken for coincidence– Pearson’s χ2

– Simple correlations

© 2014 MapR Technologies 12

Blast from the past

© 2014 MapR Technologies 13

Scale does not cure wrong

It just makes easy more common

© 2014 MapR Technologies 14

A core technique• Many of these easy problems reduce to finding interesting

coincidences

• This can be summarized as a 2 x 2 table

• Actually, many of these tables

A OtherB k11 k12

Other

k21 k22

© 2014 MapR Technologies 15

How do you do that?• This is well handled using G-test

– See wikipedia– See http://bit.ly/surprise-and-coincidence

• Original application in linguistics now cited > 2000 times

• Available in ElasticSearch, in Solr, in Mahout• Available in R, C, Java, Python

© 2014 MapR Technologies 16

Which one is the anomalous co-occurrence?

A not AB 13 1000

not B 1000 100,000

A not AB 1 0

not B 0 10,000

A not AB 10 0

not B 0 100,000

A not AB 1 0

not B 0 2

© 2014 MapR Technologies 17

Which one is the anomalous co-occurrence?

A not AB 13 1000

not B 1000 100,000

A not AB 1 0

not B 0 10,000

A not AB 10 0

not B 0 100,000

A not AB 1 0

not B 0 20.90 1.95

4.52 14.3

Dunning Ted, Accurate Methods for the Statistics of Surprise and Coincidence, Computational Linguistics vol 19 no. 1 (1993)

© 2014 MapR Technologies 18

So we can find interesting coincidence

and that gets us exactly what?

© 2014 MapR Technologies 19

Cooccurrence Analysis

© 2014 MapR Technologies 20

Real-life example• Query: “Paco de Lucia”• Conventional meta-data search results:

– “hombres de paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

© 2014 MapR Technologies 21

Real-life example

© 2014 MapR Technologies 22

Any other domains?

© 2014 MapR Technologies 23

Document classification

© 2014 MapR Technologies 24

Language identification

© 2014 MapR Technologies 25

OK … Works for language

Anything else?

© 2014 MapR Technologies 26

Species identification

© 2014 MapR Technologies 27

Anything useful?

Like, to do with money?

© 2014 MapR Technologies 28

Common Point of Compromise• Scenario:

– Merchant 0 is compromised, leaks account data during compromise– Fraud committed elsewhere during exploit– High background level of fraud– Limited detection rate for exploits

• Goal:– Find merchant 0

• Meta-goal:– Screen algorithms for this task without leaking sensitive data

© 2014 MapR Technologies 29

Simulation Setup

© 2014 MapR Technologies 30

Simulation Strategy• For each consumer

– Pick consumer parameters such as transaction rate, preferences– Generate transactions until end of sim-time

• If merchant 0 during compromise time, possibly mark as compromised• For all transactions, possible mark as fraud, probability depends on history• Merchants are selected using hierarchical Pittman-Yor

• Restate data– Flatten transaction streams– Sort by time

• Tunables– Compromise probability, background fraud, detection probability

© 2014 MapR Technologies 31

But that isn’t very realistic!• No details of the fraud• No details of the fraudsters• No details on the transactions• No details on the models

• How can this be any good at all?

© 2014 MapR Technologies 32

Secure Development is Hard

© 2014 MapR Technologies 33

Secure Development is Hard

Outside collaborators are outside the security perimeter

They can’t see the data and they can’t tune new algorithms to fit reality

© 2014 MapR Technologies 34

How To Make Realistic Data

© 2014 MapR Technologies 35

Parametric Simulation

Parametric matching of failure signatures allows emulation of complex data properties

Matching on KPI’s and failure modes guarantees practical fidelity

© 2014 MapR Technologies 36

Performance Indicators to Match• User and merchant population• Transaction count/consumer• Merchant propensity skew• Level of detected fraud• Spectrum of meta-model scores

© 2014 MapR Technologies 37

So how does it work in practice?

© 2014 MapR Technologies 38

© 2014 MapR Technologies 39

Really truly bad guys

© 2014 MapR Technologies 40

Summary• We live in a golden age of newly achieved scale

• That scale has lowered the tree– Hard problems are much easier– Lots of low-hanging fruit all around us

• Cheap learning has huge value

• Code available at http://github.com/tdunning

© 2014 MapR Technologies 41

Me, Us• Ted Dunning, Chief Application Architect, MapR

– Committer PMC member Zookeeper, Drill– VP Incubator– Bought the beer at the first HUG

• MapR– Distributes more open source components for Hadoop– Adds major technology for performance, HA, industry standard API’s

• Info– Hash tag - #mapr #mlconfatl– See also - @ted_dunning and @mapR