Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS Feng Niu, Chris Ré, AnHai Doan, and Jude Shavlik University of Wisconsin-Madison

Untitled

Tuffy

Scaling up Statistical Inferencein Markov Logic using an RDBMSFeng Niu, Chris R, AnHai Doan, and Jude ShavlikUniversity of Wisconsin-Madison

1One Slide Summary2Machine Reading is a DARPA program to capture knowledge expressed in free-form textWe use Markov Logic, a language that allows rules that are likely but not certain to be correct Markov Logic yields high quality, but current implementations are confined to small scalesTuffy scales up Markov Logic by orders of magnitude using an RDBMSSimilar challenges in enterprise applicationsOutlineMarkov LogicData modelQuery languageInference = grounding then searchTuffy the SystemScaling up grounding with RDBMSScaling up search with partitioning33A Familiar Data Model4

Relations with known factsRelations tobe predicted

Markov Logic programDatalog?Datalog + Weights Markov Logic4Markov Logic*53 wrote(s,t) advisedBy(s,p) wrote(p,t)// students papers tend to be co-authored by advisors* [Richardson & Domingos 2006]exponential models5Markov Logic by Example6Rules3 wrote(s,t) advisedBy(s,p) wrote(p,t) // students papers tend to be co-authored by advisors5 advisedBy(s,p) advisedBy(s,q) p = q // students tend to have at most one advisor advisedBy(s,p) professor(p) // advisors must be professors

Evidencewrote(Tom, Paper1)wrote(Tom, Paper2)wrote(Jerry, Paper1) professor(John)

Query

advisedBy(?, ?)// who advises whom

EDBIDB6Inference7Rules

EvidenceRelations

QueryRelations

Inferenceregular tuplestuple probabilitiesMAPMarginal7Inference8Rules

EvidenceRelations

QueryRelations

GroundingSearchFind tuples that are relevant(to the query)Find tuples that are true(in most likely world)8How to Perform InferenceStep 1: GroundingInstantiate the rules

9wrote(s, t) advisedBy(s, p) wrote(p, t)3 wrote(Tom, P1) advisedBy(Tom, Jerry) wrote (Jerry, P1)3 wrote(Tom, P1) advisedBy(Tom, Chuck) wrote (Chuck, P1)3 wrote(Chuck, P1) advisedBy(Chuck, Jerry) wrote (Jerry, P1)3 wrote(Chuck, P2) advisedBy(Chuck, Jerry) wrote (Jerry, P2)Grounding9How to Perform InferenceStep 1: GroundingInstantiated rules Markov Random Field (MRF)A graphical structure of correlations103 wrote(Tom, P1) advisedBy(Tom, Jerry) wrote (Jerry, P1)3 wrote(Tom, P1) advisedBy(Tom, Chuck) wrote (Chuck, P1)3 wrote(Chuck, P1) advisedBy(Chuck, Jerry) wrote (Jerry, P1)3 wrote(Chuck, P2) advisedBy(Chuck, Jerry) wrote (Jerry, P2)Nodes: Truth values of tuplesEdges: Instantiated rules10How to Perform InferenceStep 2: SearchProblem: Find most likely state of the MRF (NP-hard)Algorithm: WalkSAT*, random walk with heuristicsRemember lowest-cost world ever seen

11adviseeadvisorTomJerryTomChuckSearch* [Kautz et al. 2006]FalseTrue11OutlineMarkov LogicData modelQuery languageInference = grounding then searchTuffy the SystemScaling up grounding with RDBMSScaling up search with partitioning

1212Challenge 1: Scaling GroundingPrevious approachesStore all data in RAMTop-down evaluation13RAM size quickly becomes bottleneckEven when runnable, grounding takes long time[Singla and Domingos 2006][Shavlik and Natarajan 2009]13Grounding in Alchemy*14Prolog-style top-down grounding with C++ loopsHand-coded pruning, reordering strategies3 wrote(s, t) advisedBy(s, p) wrote(p, t)For each person s: For each paper t: If !wrote(s, t) then continue For each person p: If wrote(p, t) then continue Emit grounding using Grounding sometimes accounts for over 90% of Alchemys run time[*] reference system from UWash14Grounding in TuffyEncode grounding as SQL queries15

Executed and optimized by RDBMS15Grounding PerformanceTuffy achieves orders of magnitude speed-up16Relational ClassificationEntity ResolutionAlchemy [C++]68 min420 minTuffy [Java + PostgreSQL]1 min3 minEvidence tuples430K676Query tuples10K16KRules153.8KYes, join algorithms & optimizer are the key!16Challenge 2: Scaling Search17GroundingSearch

17Challenge 2: Scaling SearchFirst attempt: pure RDBMS, search also in SQLNo-go: millions of random accessesObvious fix: hybrid architecture18Problem: stuck if |MRF | > |RAM|!RDBMSRAMRAMRDBMSRAMRDBMSGroundingSearchAlchemyTuffy-DBTuffy18Partition to Scale up SearchObservationMRF sometimes have multiple componentsSolutionPartition graph into componentsProcess in turn

1919Effect of PartitioningPro

Con (?)Motivated by scalabilityWilling to sacrifice quality20

ScalabilityParallelismWhats the effect on quality?20Partitioning Hurts Quality?21Relational ClassificationGoal: lower the cost quicklyPartitioning can actually improve quality!Alchemy took over 1 hr.Quality similar to Tuffy-no-part21WalkSATiterationcost1cost2cost1 + cost2152025min52025Partitioning (Actually) Improves Quality22Reason:

TuffyTuffy-no-part22WalkSATiterationcost1cost2cost1 + cost21520252201030min51025Partitioning (Actually) Improves Quality23Reason:

TuffyTuffy-no-part23WalkSATiterationcost1cost2cost1 + cost21520252201030320525min5525Partitioning (Actually) Improves Quality24Reason:

TuffyTuffy-no-partcost[Tuffy] = 10cost[Tuffy-no-part] = 2524100 components 100 years of gap!Under certain technical conditions, component-wise partitioning reduces expected time to hit an optimal state by (2 ^ #components) steps.Partitioning (Actually) Improves Quality25Theorem (roughly):25Further PartitioningPartition one component further into pieces26GraphScalabilityQualitySparseDenseIn the paper: cost-based trade-off model/ConclusionMarkov Logic is a powerful framework for statistical inferenceBut existing implementations do not scaleTuffy scales up Markov Logic inferenceRDBMS query processing is perfect fit for groundingPartitioning improves search scalability and qualityTry it out!27http://www.cs.wisc.edu/hazy/tuffy

27

Documents

Tuffy Scaling up Statistical Inference in Markov Logic using an RDBMS Feng Niu, Chris Ré, AnHai Doan, and Jude Shavlik University of Wisconsin-Madison