Upload
lucenerevolution
View
3.443
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.
Citation preview
RAPID PRUNING OF SEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc.
5/2/13 1
My Background • Machine Learning Scien8st at ThreatMetrix Inc. • Co-‐ Chair, Developer Programs, IntelliFest.org, Oct 2013,
San Diego, CA Career Path -‐ Siemens Corporate Research: Learning & Expert Systems -‐ Technology division of Donaldson, LuQin and JenreSe
company (Pershing): Ar8ficial Intelligence Group -‐ Network Monitoring
-‐ Several startups: Classifica8on, Web Crawling, Security, Financial Trading etc.
5/2/13 2
Outline
• Task descrip8on • Approaches • Why search paradigm? • Hierarchical matching • Results • Acknowledgments
5/2/13 3
The Device Iden8fica8on Task
• Computa8onally, it’s a CLASSIFICATION problem: { a0, a1, a2, a3……….. an } è { ci } ai = ( aSribute | field | key ) value ci = ( label | signature | class | hash )
• Returning devices should be correctly iden8fied within certain tolerances
• New classes may be created if a good match is not found in the repository of known devices
• Devices age out, based on data reten8on policy 5/2/13 4
Task Challenges
• Extremely vola8le aSributes • There are no pivot aSributes to divide and conquer the search space
• Changing distribu8ons • Emphasis on PRECISION • Stringent RESPONSE 8me
5/2/13 5
Engineering Challenges
• Precision (accuracy) and latency (response 8me) are antagonis8c constraints
• Project management
Repository Size (millions)
Load (TPS)
Latency (ms)
Project start 28 200 < 100
Present 280 300 < 100
Change 10 X 1.5 X None
5/2/13 6
Approaches
• Rules engine • Learning models • Vector space models Need an enterprise grade solu8on!
5/2/13 7
Rules Engine
• No experts • Number of rules? • Maintenance?
Not a viable approach!
5/2/13 8
Learning Models
• Most machine learning methods deal predominantly with binary classifica8on problems (eg. fraud / not fraud) or a small number of target classes
• Few exemplars for each class • ASribute values may be unbounded • ASributes may not follow a natural progression
5/2/13 9
Learning Models …
• Unsupervised learning such as clustering methods would make good models, but not good enough to be of prac8cal use. Any simplifica8on process will compromise on accuracy
• Ability to explain is cri8cal • Tend to ignore domain knowledge Challenge in providing enterprise solu8on
5/2/13 10
Thoughts
• No comparable applica8on with such requirements
• Build and deploy a classifier that explains itself easily, scales temporally and offers quick response
• Use domain knowledge to guide verifica8on • Improve the classifier through machine learning methods by analyzing performance in the field
5/2/13 11
Vector-‐Space Models
• Similarity based search make vector-‐space model a good choice for genera8ng selec8ons
• Given the vola8le nature of data, informa8on retrieval (IR) systems can adapt easily
• Good at neighborhood search Sensi8ve to individual aSribute changes!
5/2/13 12
Sources of Inspira8on
• Lucene/Solr features • Documenta8on from (erstwhile) Lucid Imagina8on
• Ease with which Lucene/Solr could be installed and explored
Very short learning curve for novices!
5/2/13 13
Feature Selec8on
• Primi8ve and derived aSributes • Entropy • Distribu8on
5/2/13 14
Domain
• Devices come with structural informa8on but not much grammar or seman8cs
• Bag-‐of-‐words (single field) approach is fast but not precise
• Using all fields is precise but response is slow Now what?
5/2/13 15
Disjunc8on Max • Matrix of all possible combina8ons of user input query and document fields
• Transforms into a Boolean query of Disjunc8onMaxQueries of each row
• Maximum score of sub clauses Is used by Disjunc8onMaxQuery
• No single term in user input dominates This is needed! Src: SearchHub and LucidWorks 5/2/13 16
DisMax Experiments (index size = 60 Million)
Scenario 1
mm=2 Solr fields = { a1, a2, a3 } Values= { phrase1, phrase2, phrase3} Must-‐Match Clauses Latency: YES (35 ms) Precision: NO (20% failure)
5/2/13 17
Scenario 2
mm = 50 % Solr fields = { a1 } Values= { term1, term2, term3 …. termn } Should-‐Match Clauses Latency: NO (> 2 seconds) Precision: YES (> 98%)
Possible Workaround
• Look-‐ahead: Customize Lucene/Solr to do a branch-‐and-‐bound search, bail out on some lower bound score
• Minimize candidates for DisMax search -‐ reduce total number of Solr instances to search -‐ reduce total number of disjunc8ve terms
[ Empirical es8mate: tn = 2 * tn-‐1 where t = 8me & n = number of disjunc8ve terms]
5/2/13 18
Phrases over Terms
• Used coloca8on (co-‐occurrence matrix) to determine most common phrases
• Delete terms covered by phrases • Add stop words based on frequency analysis • Ensure precision is preserved through regression tests
Reduced the number of DisMax terms by 30%
5/2/13 19
Sources of Inspira8on
• Planning in a Hierarchy of Abstrac8on Spaces, Ar8ficial Intelligence, Vol. 5, No. 2, pp. 115-‐135 (1974)
• Search Reduc8on in Hierarchical Problem Solving, Proc. Of the 9th IJCAI, AAAI Press, Menlo Park, CA (1991)
• Excep8onal Data Quality Using Intelligent Matching and Retrieval, AI Magazine, AAAI Press (Spring 2010)
5/2/13 20
Hierarchical Matching
Bag of words
Models Phrases
Filters DisMax
Query Formulator
Domain-‐specific paSerns
CSV/JSON
Solr instances selector
To Solr Servers
5/2/13 21
Verifica8on
Conflict Resolu8on
• Top n candidates are returned from each Solr instance
• They are ranked based on custom verifica8on module
• Ties are broken using recency • Top candidate is persisted and returned along with custom score
5/2/13 22
Comments
• Dismax performs mul8dimensional match • Extracted mul8ple filters and arranged them hierarchically
• Separa8on of selec8on and evalua8on -‐ Selec8on = approximate solu8on -‐ Evalua8on = refinement
5/2/13 23
Where 8me went..
• ASribute selec8on • Ranking • Op8miza8on • Index re-‐genera8on • Regression tes8ng
5/2/13 24
Sources for Tune Up
• Scaling Solr, Lucene Revolu8on, May 2011 • Prac8cal Search with Solr: Beyond just Looking it Up, Lucid Imagina8on, May 2010
5/2/13 25
Tes8ng
• Precision tes8ng using self and mixed modes • Latency tests
-‐ custom harness for stand-‐alone tests -‐ integrated tests with JMeter framework
5/2/13 26
Results
5/2/13 27
Latency Percen8les
original edismax Ini8al solu8on
Op8miza8on 2: Domain paSerns, Stop words, de-‐dupe
Op8miza8on 1: Filters, Focused search, verifica8on
5/2/13 28
TPS
5/2/13 29
Response Times over Time
5/2/13 30
Project Execu8on
• Agile Methodology • Risk mi8ga8on through primary and con8ngency plans
• Rapid prototyping followed by good sozware engineering prac8ces
• Evalua8ng DSE (DataStax) & Solr Cloud
5/2/13 31
Gleanings
• You can classify anything with Lucene/Solr, lexicon is your own
• The ques8on is not whether Lucene/Solr can solve a par8cular classifica8on problem, but whether you can priori8ze among the many ways of doing it
• If you run into a problem, someone has solved it or will solve it in the near future
5/2/13 32
Gleanings …
• Deal with accuracy before latency • If precision, latency and scale are all cri8cal to your domain, expect to invest some8me in hierarchical abstrac8ons
• Index once, run any8me, anywhere, does not apply during development
• Throwing all data at Lucene/Solr will not work for mission cri8cal applica8ons
• Rapid prototyping and willingness to fail
5/2/13 33
Summary
Simplify and match at mul0ple levels of abstrac0on
5/2/13 34
Contributors
Chandra Mouleeswaran Research & Prototyping
Fang Chen Research & Prototyping
Luke Mertens Produc8za8on & Scalability
Brent Pearson Release Management
Tracy Hsu Precision Tes8ng & QA
5/2/13 35
Srinivas Nayani Deployment & QA
COMMENTS & FEEDBACK: Chandra Mouleeswaran [email protected]
5/2/13 36