Upload
ellen-long
View
214
Download
1
Embed Size (px)
Citation preview
Untitled
Tuffy
Scaling up Statistical Inferencein Markov Logic using an RDBMSFeng Niu, Chris R, AnHai Doan, and Jude ShavlikUniversity of Wisconsin-Madison
1One Slide Summary2Machine Reading is a DARPA program to capture knowledge expressed in free-form textWe use Markov Logic, a language that allows rules that are likely but not certain to be correct Markov Logic yields high quality, but current implementations are confined to small scalesTuffy scales up Markov Logic by orders of magnitude using an RDBMSSimilar challenges in enterprise applicationsOutlineMarkov LogicData modelQuery languageInference = grounding then searchTuffy the SystemScaling up grounding with RDBMSScaling up search with partitioning33A Familiar Data Model4
Relations with known factsRelations tobe predicted
Markov Logic programDatalog?Datalog + Weights Markov Logic4Markov Logic*53 wrote(s,t) advisedBy(s,p) wrote(p,t)// students papers tend to be co-authored by advisors* [Richardson & Domingos 2006]exponential models5Markov Logic by Example6Rules3 wrote(s,t) advisedBy(s,p) wrote(p,t) // students papers tend to be co-authored by advisors5 advisedBy(s,p) advisedBy(s,q) p = q // students tend to have at most one advisor advisedBy(s,p) professor(p) // advisors must be professors
Evidencewrote(Tom, Paper1)wrote(Tom, Paper2)wrote(Jerry, Paper1) professor(John)
Query
advisedBy(?, ?)// who advises whom
EDBIDB6Inference7Rules
EvidenceRelations
QueryRelations
Inferenceregular tuplestuple probabilitiesMAPMarginal7Inference8Rules
EvidenceRelations
QueryRelations
GroundingSearchFind tuples that are relevant(to the query)Find tuples that are true(in most likely world)8How to Perform InferenceStep 1: GroundingInstantiate the rules
9wrote(s, t) advisedBy(s, p) wrote(p, t)3 wrote(Tom, P1) advisedBy(Tom, Jerry) wrote (Jerry, P1)3 wrote(Tom, P1) advisedBy(Tom, Chuck) wrote (Chuck, P1)3 wrote(Chuck, P1) advisedBy(Chuck, Jerry) wrote (Jerry, P1)3 wrote(Chuck, P2) advisedBy(Chuck, Jerry) wrote (Jerry, P2)Grounding9How to Perform InferenceStep 1: GroundingInstantiated rules Markov Random Field (MRF)A graphical structure of correlations103 wrote(Tom, P1) advisedBy(Tom, Jerry) wrote (Jerry, P1)3 wrote(Tom, P1) advisedBy(Tom, Chuck) wrote (Chuck, P1)3 wrote(Chuck, P1) advisedBy(Chuck, Jerry) wrote (Jerry, P1)3 wrote(Chuck, P2) advisedBy(Chuck, Jerry) wrote (Jerry, P2)Nodes: Truth values of tuplesEdges: Instantiated rules10How to Perform InferenceStep 2: SearchProblem: Find most likely state of the MRF (NP-hard)Algorithm: WalkSAT*, random walk with heuristicsRemember lowest-cost world ever seen
11adviseeadvisorTomJerryTomChuckSearch* [Kautz et al. 2006]FalseTrue11OutlineMarkov LogicData modelQuery languageInference = grounding then searchTuffy the SystemScaling up grounding with RDBMSScaling up search with partitioning
1212Challenge 1: Scaling GroundingPrevious approachesStore all data in RAMTop-down evaluation13RAM size quickly becomes bottleneckEven when runnable, grounding takes long time[Singla and Domingos 2006][Shavlik and Natarajan 2009]13Grounding in Alchemy*14Prolog-style top-down grounding with C++ loopsHand-coded pruning, reordering strategies3 wrote(s, t) advisedBy(s, p) wrote(p, t)For each person s: For each paper t: If !wrote(s, t) then continue For each person p: If wrote(p, t) then continue Emit grounding using Grounding sometimes accounts for over 90% of Alchemys run time[*] reference system from UWash14Grounding in TuffyEncode grounding as SQL queries15
Executed and optimized by RDBMS15Grounding PerformanceTuffy achieves orders of magnitude speed-up16Relational ClassificationEntity ResolutionAlchemy [C++]68 min420 minTuffy [Java + PostgreSQL]1 min3 minEvidence tuples430K676Query tuples10K16KRules153.8KYes, join algorithms & optimizer are the key!16Challenge 2: Scaling Search17GroundingSearch
17Challenge 2: Scaling SearchFirst attempt: pure RDBMS, search also in SQLNo-go: millions of random accessesObvious fix: hybrid architecture18Problem: stuck if |MRF | > |RAM|!RDBMSRAMRAMRDBMSRAMRDBMSGroundingSearchAlchemyTuffy-DBTuffy18Partition to Scale up SearchObservationMRF sometimes have multiple componentsSolutionPartition graph into componentsProcess in turn
1919Effect of PartitioningPro
Con (?)Motivated by scalabilityWilling to sacrifice quality20
ScalabilityParallelismWhats the effect on quality?20Partitioning Hurts Quality?21Relational ClassificationGoal: lower the cost quicklyPartitioning can actually improve quality!Alchemy took over 1 hr.Quality similar to Tuffy-no-part21WalkSATiterationcost1cost2cost1 + cost2152025min52025Partitioning (Actually) Improves Quality22Reason:
TuffyTuffy-no-part22WalkSATiterationcost1cost2cost1 + cost21520252201030min51025Partitioning (Actually) Improves Quality23Reason:
TuffyTuffy-no-part23WalkSATiterationcost1cost2cost1 + cost21520252201030320525min5525Partitioning (Actually) Improves Quality24Reason:
TuffyTuffy-no-partcost[Tuffy] = 10cost[Tuffy-no-part] = 2524100 components 100 years of gap!Under certain technical conditions, component-wise partitioning reduces expected time to hit an optimal state by (2 ^ #components) steps.Partitioning (Actually) Improves Quality25Theorem (roughly):25Further PartitioningPartition one component further into pieces26GraphScalabilityQualitySparseDenseIn the paper: cost-based trade-off model/ConclusionMarkov Logic is a powerful framework for statistical inferenceBut existing implementations do not scaleTuffy scales up Markov Logic inferenceRDBMS query processing is perfect fit for groundingPartitioning improves search scalability and qualityTry it out!27http://www.cs.wisc.edu/hazy/tuffy
27