Rapid pruning of search space through hierarchical matching

RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.

5/2/13 1

My Background •  Machine Learning Scien8st at ThreatMetrix Inc. •  Co-‐ Chair, Developer Programs, IntelliFest.org, Oct 2013,

San Diego, CA Career Path -‐  Siemens Corporate Research: Learning & Expert Systems -‐  Technology division of Donaldson, LuQin and JenreSe

company (Pershing): Ar8ficial Intelligence Group -‐ Network Monitoring

-‐  Several startups: Classifica8on, Web Crawling, Security, Financial Trading etc.

5/2/13 2

Outline

•  Task descrip8on •  Approaches •  Why search paradigm? •  Hierarchical matching •  Results •  Acknowledgments

5/2/13 3

The Device Iden8fica8on Task

•  Computa8onally, it’s a CLASSIFICATION problem: { a0, a1, a2, a3……….. an } è { ci } ai = ( aSribute | field | key ) value ci = ( label | signature | class | hash )

•  Returning devices should be correctly iden8fied within certain tolerances

•  New classes may be created if a good match is not found in the repository of known devices

•  Devices age out, based on data reten8on policy 5/2/13 4

Task Challenges

•  Extremely vola8le aSributes •  There are no pivot aSributes to divide and conquer the search space

•  Changing distribu8ons •  Emphasis on PRECISION •  Stringent RESPONSE 8me

5/2/13 5

Engineering Challenges

•  Precision (accuracy) and latency (response 8me) are antagonis8c constraints

•  Project management

Repository Size (millions)

Load (TPS)

Latency (ms)

Project start 28 200 < 100

Present 280 300 < 100

Change 10 X 1.5 X None

5/2/13 6

Approaches

•  Rules engine •  Learning models •  Vector space models Need an enterprise grade solu8on!

5/2/13 7

Rules Engine

•  No experts •  Number of rules? •  Maintenance?

Not a viable approach!

5/2/13 8

Learning Models

•  Most machine learning methods deal predominantly with binary classifica8on problems (eg. fraud / not fraud) or a small number of target classes

•  Few exemplars for each class •  ASribute values may be unbounded •  ASributes may not follow a natural progression

5/2/13 9

Learning Models …

•  Unsupervised learning such as clustering methods would make good models, but not good enough to be of prac8cal use. Any simplifica8on process will compromise on accuracy

•  Ability to explain is cri8cal •  Tend to ignore domain knowledge Challenge in providing enterprise solu8on

5/2/13 10

Thoughts

•  No comparable applica8on with such requirements

•  Build and deploy a classifier that explains itself easily, scales temporally and offers quick response

•  Use domain knowledge to guide verifica8on •  Improve the classifier through machine learning methods by analyzing performance in the field

5/2/13 11

Vector-‐Space Models

•  Similarity based search make vector-‐space model a good choice for genera8ng selec8ons

•  Given the vola8le nature of data, informa8on retrieval (IR) systems can adapt easily

•  Good at neighborhood search Sensi8ve to individual aSribute changes!

5/2/13 12

Sources of Inspira8on

•  Lucene/Solr features •  Documenta8on from (erstwhile) Lucid Imagina8on

•  Ease with which Lucene/Solr could be installed and explored

Very short learning curve for novices!

5/2/13 13

Feature Selec8on

•  Primi8ve and derived aSributes •  Entropy •  Distribu8on

5/2/13 14

Domain

•  Devices come with structural informa8on but not much grammar or seman8cs

•  Bag-‐of-‐words (single field) approach is fast but not precise

•  Using all fields is precise but response is slow Now what?

5/2/13 15

Disjunc8on Max •  Matrix of all possible combina8ons of user input query and document fields

•  Transforms into a Boolean query of Disjunc8onMaxQueries of each row

•  Maximum score of sub clauses Is used by Disjunc8onMaxQuery

•  No single term in user input dominates This is needed! Src: SearchHub and LucidWorks 5/2/13 16

DisMax Experiments (index size = 60 Million)

Scenario 1

mm=2 Solr fields = { a1, a2, a3 } Values= { phrase1, phrase2, phrase3} Must-‐Match Clauses Latency: YES (35 ms) Precision: NO (20% failure)

5/2/13 17

Scenario 2

mm = 50 % Solr fields = { a1 } Values= { term1, term2, term3 …. termn } Should-‐Match Clauses Latency: NO (> 2 seconds) Precision: YES (> 98%)

Possible Workaround

•  Look-‐ahead: Customize Lucene/Solr to do a branch-‐and-‐bound search, bail out on some lower bound score

•  Minimize candidates for DisMax search -‐  reduce total number of Solr instances to search -‐  reduce total number of disjunc8ve terms

[ Empirical es8mate: tn = 2 * tn-‐1 where t = 8me & n = number of disjunc8ve terms]

5/2/13 18

Phrases over Terms

•  Used coloca8on (co-‐occurrence matrix) to determine most common phrases

•  Delete terms covered by phrases •  Add stop words based on frequency analysis •  Ensure precision is preserved through regression tests

Reduced the number of DisMax terms by 30%

5/2/13 19

Sources of Inspira8on

•  Planning in a Hierarchy of Abstrac8on Spaces, Ar8ficial Intelligence, Vol. 5, No. 2, pp. 115-‐135 (1974)

•  Search Reduc8on in Hierarchical Problem Solving, Proc. Of the 9th IJCAI, AAAI Press, Menlo Park, CA (1991)

•  Excep8onal Data Quality Using Intelligent Matching and Retrieval, AI Magazine, AAAI Press (Spring 2010)

5/2/13 20

Hierarchical Matching

Bag of words

Models Phrases

Filters DisMax

Query Formulator

Domain-‐specific paSerns

CSV/JSON

Solr instances selector

To Solr Servers

5/2/13 21

Verifica8on

Conflict Resolu8on

•  Top n candidates are returned from each Solr instance

•  They are ranked based on custom verifica8on module

•  Ties are broken using recency •  Top candidate is persisted and returned along with custom score

5/2/13 22

Comments

•  Dismax performs mul8dimensional match •  Extracted mul8ple filters and arranged them hierarchically

•  Separa8on of selec8on and evalua8on -‐  Selec8on = approximate solu8on -‐  Evalua8on = refinement

5/2/13 23

Where 8me went..

•  ASribute selec8on •  Ranking •  Op8miza8on •  Index re-‐genera8on •  Regression tes8ng

5/2/13 24

Sources for Tune Up

•  Scaling Solr, Lucene Revolu8on, May 2011 •  Prac8cal Search with Solr: Beyond just Looking it Up, Lucid Imagina8on, May 2010

5/2/13 25

Tes8ng

•  Precision tes8ng using self and mixed modes •  Latency tests

-‐  custom harness for stand-‐alone tests -‐  integrated tests with JMeter framework

5/2/13 26

Results

5/2/13 27

Latency Percen8les

original edismax Ini8al solu8on

Op8miza8on 2: Domain paSerns, Stop words, de-‐dupe

Op8miza8on 1: Filters, Focused search, verifica8on

5/2/13 28

TPS

5/2/13 29

Response Times over Time

5/2/13 30

Project Execu8on

•  Agile Methodology •  Risk mi8ga8on through primary and con8ngency plans

•  Rapid prototyping followed by good sozware engineering prac8ces

•  Evalua8ng DSE (DataStax) & Solr Cloud

5/2/13 31

Gleanings

•  You can classify anything with Lucene/Solr, lexicon is your own

•  The ques8on is not whether Lucene/Solr can solve a par8cular classifica8on problem, but whether you can priori8ze among the many ways of doing it

•  If you run into a problem, someone has solved it or will solve it in the near future

5/2/13 32

Gleanings …

•  Deal with accuracy before latency •  If precision, latency and scale are all cri8cal to your domain, expect to invest some8me in hierarchical abstrac8ons

•  Index once, run any8me, anywhere, does not apply during development

•  Throwing all data at Lucene/Solr will not work for mission cri8cal applica8ons

•  Rapid prototyping and willingness to fail

5/2/13 33

Summary

Simplify and match at mul0ple levels of abstrac0on

5/2/13 34

Contributors

Chandra Mouleeswaran Research & Prototyping

Fang Chen Research & Prototyping

Luke Mertens Produc8za8on & Scalability

Brent Pearson Release Management

Tracy Hsu Precision Tes8ng & QA

5/2/13 35

Srinivas Nayani Deployment & QA

COMMENTS & FEEDBACK: Chandra Mouleeswaran [email protected]

5/2/13 36

Education

Rapid pruning of search space through hierarchical matching