Markov Logic in Natural Language Processing Hoifung Poon Dept. of Computer Science & Eng. University of Washington

1

Markov Logic in Natural Language Processing

Hoifung PoonDept. of Computer Science & Eng.

University of Washington

2

Overview

Motivation Foundational areasMarkov logicNLP applications Basics Supervised learning Unsupervised learning

3

Languages Are Structural

governments

lm$pxtm(according to their families)

4


govern-ment-s

l-m$px-t-m(according to their families)

S

V NP

NP VP

IL-4 induces CD11B

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41......

involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

George Walker Bush was the 43rd President of the United States.…… Bush was the eldest son of President G. H. W. Bush and Babara Bush. …….In November 1977, he met Laura Welch at a barbecue.

5


govern-ment-s

l-m$px-t-m(according to their families)

S

V NP

NP VP

IL-4 induces CD11B

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41......

involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

George Walker Bush was the 43rd President of the United States.…… Bush was the eldest son of President G. H. W. Bush and Babara Bush. …….In November 1977, he met Laura Welch at a barbecue.

6


Objects are not just feature vectors They have parts and subparts Which have relations with each other They can be trees, graphs, etc.

Objects are seldom i.i.d.(independent and identically distributed) They exhibit local and global dependencies They form class hierarchies (with multiple inheritance) Objects’ properties depend on those of related objects

Deeply interwoven with knowledge

7

First-Order Logic

Main theoretical foundation of computer scienceGeneral language for describing

complex structures and knowledge Trees, graphs, dependencies, hierarchies, etc.

easily expressed Inference algorithms (satisfiability testing,

theorem proving, etc.)

8

G. W. Bush ………… Laura Bush ……Mrs. Bush ……

Languages Are Statistical

I saw the man with the telescope


NP

NP ADVP


Here in London, Frances Deek is a retired teacher …In the Israeli town …, Karen London says …Now London says …

London PERSON or LOCATION?

Microsoft buys Powerset

Microsoft acquires Powerset

Powerset is acquired by Microsoft Corporation

The Redmond software giant buys Powerset

Microsoft’s purchase of Powerset, …

……

Which one?

9

Languages Are Statistical

Languages are ambiguousOur information is always incompleteWe need to model correlationsOur predictions are uncertain Statistics provides the tools to handle this

10

Probabilistic Graphical Models

Mixture modelsHidden Markov models Bayesian networksMarkov random fieldsMaximum entropy modelsConditional random fields Etc.

11

The Problem

Logic is deterministic, requires manual coding Statistical models assume i.i.d. data,

objects = feature vectors Historically, statistical and logical NLP

have been pursued separately We need to unify the two! Burgeoning field in machine learning:

Statistical relational learning

12

Costs and Benefits ofStatistical Relational Learning

Benefits Better predictive accuracy Better understanding of domains Enable learning with less or no labeled data

Costs Learning is much harder Inference becomes a crucial issue Greater complexity for user

13

Progress to Date

Probabilistic logic [Nilsson, 1986] Statistics and beliefs [Halpern, 1990] Knowledge-based model construction

[Wellman et al., 1992] Stochastic logic programs [Muggleton, 1996] Probabilistic relational models [Friedman et al., 1999] Relational Markov networks [Taskar et al., 2002] Etc. This talk: Markov logic [Domingos & Lowd, 2009]

14

Markov Logic: A Unifying Framework

Probabilistic graphical models andfirst-order logic are special cases

Unified inference and learning algorithms Easy-to-use software: Alchemy Broad applicabilityGoal of this tutorial:

Quickly learn how to use Markov logic and Alchemy for a broad spectrum of NLP applications

15

Overview Motivation Foundational areas

Probabilistic inference Statistical learning Logical inference Inductive logic programming

Markov logic NLP applications

Basics Supervised learning Unsupervised learning

16

Markov Networks Undirected graphical models

Cancer

CoughAsthma

Smoking

Potential functions defined over cliques

Smoking Cancer Ф(S,C)

False False 4.5

False True 4.5

True False 2.7

True True 4.5

c

cc xZ

xP )(1

)(

x c

cc xZ )(

17

Markov Networks Undirected graphical models

Log-linear model:

Weight of Feature i Feature i

otherwise0

CancerSmokingif1)CancerSmoking,(1f

5.11 w

Cancer

CoughAsthma

Smoking

iii xfw

ZxP )(exp

1)(

18

Markov Nets vs. Bayes Nets

Property Markov Nets Bayes Nets

Form Prod. potentials Prod. potentials

Potentials Arbitrary Cond. probabilities

Cycles Allowed Forbidden

Partition func. Z = ? Z = 1

Indep. check Graph separation D-separation

Indep. props. Some Some

Inference MCMC, BP, etc. Convert to Markov

19

Inference in Markov NetworksGoal: compute marginals & conditionals of

Exact inference is #P-completeConditioning on Markov blanket is easy:

Gibbs sampling exploits this

exp ( )( | ( ))

exp ( 0) exp ( 1)

i ii

i i i ii i

w f xP x MB x

w f x w f x

1( ) exp ( )i i

i

P X w f XZ

exp ( )i i

X i

Z w f X

20

MCMC: Gibbs Sampling

state ← random truth assignmentfor i ← 1 to num-samples do for each variable x sample x according to P(x|neighbors(x)) state ← state with new value of xP(F) ← fraction of states in which F is true

21

Other Inference Methods

Belief propagation (sum-product)Mean field / Variational approximations

22

MAP/MPE Inference

Goal: Find most likely state of world given evidence

)|(max xyPy

Query Evidence

23

MAP Inference Algorithms

Iterated conditional modes Simulated annealingGraph cuts Belief propagation (max-product) LP relaxation

24





25

Generative Weight Learning

Maximize likelihoodUse gradient ascent or L-BFGSNo local maxima

Requires inference at each step (slow!)

No. of times feature i is true in data

Expected no. times feature i is true according to model

)()()(log xnExnxPw iwiw

i

26

Pseudo-Likelihood

Likelihood of each variable given its neighbors in the data

Does not require inference at each stepWidely used in vision, spatial statistics, etc. But PL parameters may not work well for

long inference chains

i

ii xneighborsxPxPL ))(|()(

27

Discriminative Weight Learning

Maximize conditional likelihood of query (y) given evidence (x)

Approximate expected counts by counts in MAP state of y given x

No. of true groundings of clause i in data

Expected no. true groundings according to model

),(),()|(log yxnEyxnxyPw iwiw

i

28

wi ← 0for t ← 1 to T do yMAP ← Viterbi(x) wi ← wi + η [counti(yData) – counti(yMAP)]return wi / T

Voted Perceptron

Originally proposed for training HMMs discriminatively

Assumes network is linear chain Can be generalized to arbitrary networks

29





30

First-Order Logic

Constants, variables, functions, predicatesE.g.: Anna, x, MotherOf(x), Friends(x, y)

Literal: Predicate or its negationClause: Disjunction of literalsGrounding: Replace all variables by constants

E.g.: Friends (Anna, Bob)World (model, interpretation):

Assignment of truth values to all ground predicates

31

Inference in First-Order Logic Traditionally done by theorem proving

(e.g.: Prolog) Propositionalization followed by model

checking turns out to be faster (often by a lot) Propositionalization:

Create all ground atoms and clausesModel checking: Satisfiability testing Two main approaches: Backtracking (e.g.: DPLL) Stochastic local search (e.g.: WalkSAT)

32

Satisfiability

Input: Set of clauses(Convert KB to conjunctive normal form (CNF))

Output: Truth assignment that satisfies all clauses, or failure

The paradigmatic NP-complete problem Solution: Search Key point:

Most SAT problems are actually easy Hard region: Narrow range of

#Clauses / #Variables

33

Stochastic Local Search

Uses complete assignments instead of partial Start with random state Flip variables in unsatisfied clausesHill-climbing: Minimize # unsatisfied clauses Avoid local minima: Random flipsMultiple restarts

34

The WalkSAT Algorithm

for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if all clauses satisfied then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes # satisfied clausesreturn failure

35





36

Rule Induction

Given: Set of positive and negative examples of some concept Example: (x1, x2, … , xn, y) y: concept (Boolean) x1, x2, … , xn: attributes (assume Boolean)

Goal: Induce a set of rules that cover all positive examples and no negative ones Rule: xa ^ xb ^ … y (xa: Literal, i.e., xi or its negation) Same as Horn clause: Body Head Rule r covers example x iff x satisfies body of r

Eval(r): Accuracy, info gain, coverage, support, etc.

37

Learning a Single Rule

head ← ybody ← Ørepeat for each literal x rx ← r with x added to body Eval(rx) body ← body ^ best xuntil no x improves Eval(r)return r

38

Learning a Set of Rules

R ← ØS ← examplesrepeat learn a single rule r R ← R U { r } S ← S − positive examples covered by runtil S = Øreturn R

39

First-Order Rule Induction

y and xi are now predicates with argumentsE.g.: y is Ancestor(x,y), xi is Parent(x,y)

Literals to add are predicates or their negations Literal to add must include at least one variable

already appearing in rule Adding a literal changes # groundings of rule

E.g.: Ancestor(x,z) ^ Parent(z,y) Ancestor(x,y) Eval(r) must take this into account

E.g.: Multiply by # positive groundings of rule still covered after adding literal

40

Overview


41

Markov Logic

Syntax: Weighted first-order formulasSemantics: Feature templates for Markov

networks Intuition: Soften logical constraintsGive each formula a weight

(Higher weight Stronger constraint)

satisfiesit formulas of weightsexpP(world)

42

Example: Coreference Resolution

Mentions of Obama are often headed by "Obama"Mentions of Obama are often headed by "President"Appositions usually refer to the same entity

Barack Obama, the 44th President of the United States, is the first African American to hold the office. ……

43

MentionOf ( ,Obama) Head( ,"Obama")

MentionOf ( ,Obama) Head( ,"President")

, , Apposition( , ) MentionOf ( , ) MentionOf ( , )

x x x

x x x

x y c x y x c y c


44


1.5

0.8

100




x x x

x x x

x y c x y x c y c

45


1.5

0.8

100

Head(A,“Obama”)

MentionOf(A,Obama)

Head(A,“President”)

MentionOf(B,Obama)

Apposition(A,B)

Head(B,“Obama”)

Head(B,“President”)

Two mention constants: A and B

Apposition(B,A)




x x x

x x x

x y c x y x c y c

46

Markov Logic Networks MLN is template for ground Markov nets Probability of a world x:

Typed variables and constants greatly reduce size of ground Markov net

Functions, existential quantifiers, etc. Can handle infinite domains [Singla & Domingos, 2007]

and continuous domains [Wang & Domingos, 2008]

Weight of formula i No. of true groundings of formula i in x

iii xnw

ZxP )(exp

1)(

47

Relation to Statistical Models

Special cases: Markov networks Markov random fields Bayesian networks Log-linear models Exponential models Max. entropy models Gibbs distributions Boltzmann machines Logistic regression Hidden Markov models Conditional random fields

Obtained by making all predicates zero-arity

Markov logic allows objects to be interdependent (non-i.i.d.)

48

Relation to First-Order Logic

Infinite weights First-order logic Satisfiable KB, positive weights

Satisfying assignments = Modes of distributionMarkov logic allows contradictions between

formulas

49

MLN Algorithms:The First Three Generations

Problem First generation

Second generation

Third generation

MAP inference

Weighted satisfiability

Lazy inference

Cutting planes

Marginal inference

Gibbs sampling

MC-SAT Lifted inference

Weight learning

Pseudo-likelihood

Voted perceptron

Scaled conj. gradient

Structure learning

Inductive logic progr.

ILP + PL (etc.)

Clustering + pathfinding

50

MAP/MPE Inference

Problem: Find most likely state of world given evidence

)|(max xyPy

Query Evidence

51

MAP/MPE Inference


i

iix

yyxnw

Z),(exp

1max

52

MAP/MPE Inference


i

iiy

yxnw ),(max

53

MAP/MPE Inference


This is just the weighted MaxSAT problemUse weighted SAT solver

(e.g., MaxWalkSAT [Kautz et al., 1997] )

i

iiy

yxnw ),(max

54

The MaxWalkSAT Algorithm

for i ← 1 to max-tries do solution = random truth assignment for j ← 1 to max-flips do if weights(sat. clauses) > threshold then return solution c ← random unsatisfied clause with probability p flip a random variable in c else flip variable in c that maximizes weights(sat. clauses) return failure, best solution found

55

Computing Probabilities

P(Formula|MLN,C) = ?MCMC: Sample worlds, check formula holds P(Formula1|Formula2,MLN,C) = ? If Formula2 = Conjunction of ground atoms First construct min subset of network necessary

to answer query (generalization of KBMC) Then apply MCMC

56

But … Insufficient for Logic

Problem:Deterministic dependencies break MCMCNear-deterministic ones make it very slow

Solution:Combine MCMC and WalkSAT

→ MC-SAT algorithm [Poon & Domingos, 2006]

57

Auxiliary-Variable Methods

Main ideas: Use auxiliary variables to capture dependencies Turn difficult sampling into uniform sampling

Given distribution P(x)

Sample from f (x, u), then discard u

1, if 0 ( )( , ) ( , ) ( )

0, otherwise

u P xf x u f x u du P x

58

Slice Sampling [Damien et al. 1999]

Xx(k)

u(k)

x(k+1)

Slice

U P(x)

59

Slice Sampling

Identifying the slice may be difficult

Introduce an auxiliary variable ui for each Фi

1( ) ( )i

i

P x xZ

1

1 if 0 ( )( , , , )

0 otherwisei i

n

u xf x u u

60

The MC-SAT Algorithm Select random subset M of satisfied clauses With probability 1 – exp ( – wi )

Larger wi Ci more likely to be selected

Hard clause (wi ): Always selected

Slice States that satisfy clauses in M Uses SAT solver to sample x | u. Orders of magnitude faster than Gibbs sampling,

etc.

61

But … It Is Not Scalable

1000 researchers Coauthor(x,y): 1 million ground atoms Coauthor(x,y) Coauthor(y,z) Coauthor(x,z): 1

billion ground clauses Exponential in arity

62

Sparsity to the Rescue 1000 researchers Coauthor(x,y): 1 million ground atoms

But … most atoms are false Coauthor(x,y) Coauthor(y,z) Coauthor(x,z):

1 billion ground clauses

Most trivially satisfied if most atoms are falseNo need to explicitly compute most of them

63

Lazy Inference

LazySAT [Singla & Domingos, 2006a] Lazy version of WalkSAT [Selman et al., 1996]

Grounds atoms/clauses as needed Greatly reduces memory usage

The idea is much more general [Poon & Domingos, 2008a]

64

General Method for Lazy Inference

If most variables assume the default value, wasteful to instantiate all variables / functions

Main idea: Allocate memory for a small subset of

“active” variables / functions Activate more if necessary as inference proceeds

Applicable to a diverse set of algorithms: Satisfiability solvers (systematic, local-search), Markov chain Monte Carlo, MPE / MAP algorithms, Maximum expected utility algorithms, Belief propagation, MC-SAT, Etc.

Reduce memory and time by orders of magnitude

65

Lifted Inference

Consider belief propagation (BP)Often in large problems, many nodes are

interchangeable:They send and receive the same messages throughout BP

Basic idea: Group them into supernodes, forming lifted network

Smaller network → Faster inference Akin to resolution in first-order logic

66

Belief Propagation

Nodes (x)

Features (f)

}\{)(

)()(fxnh

xhfx xx

}{~ }\{)(

)( )()(x xfny

fyxwf

xf yex

67

Lifted Belief Propagation

}\{)(

)()(fxnh

xhfx xx

}{~ }\{)(

)( )()(x xfny

fyxwf

xf yex

Nodes (x)

Features (f)

68

Lifted Belief Propagation

}\{)(

)()(fxnh

xhfx xx

}{~ }\{)(

)( )()(x xfny

fyxwf

xf yex

, :Functions of edge counts

Nodes (x)

Features (f)

69

Learning

Data is a relational databaseClosed world assumption (if not: EM) Learning parameters (weights) Learning structure (formulas)

70

Parameter tying: Groundings of same clause

Generative learning: Pseudo-likelihoodDiscriminative learning: Conditional likelihood,

use MC-SAT or MaxWalkSAT for inference

Parameter Learning

No. of times clause i is true in data

Expected no. times clause i is true according to MLN

log ( ) ( ) ( )i x ii

P x n x E n xw

71

Parameter Learning

Pseudo-likelihood + L-BFGS is fast and robust but can give poor inference results

Voted perceptron:Gradient descent + MAP inference

Scaled conjugate gradient

72

HMMs are special case of MLNsReplace Viterbi by MaxWalkSATNetwork can now be arbitrary graph

wi ← 0for t ← 1 to T do yMAP ← MaxWalkSAT(x) wi ← wi + η [counti(yData) – counti(yMAP)]return wi / T

Voted Perceptron for MLNs

73

Problem: Multiple Modes

Not alleviated by contrastive divergence Alleviated by MC-SATWarm start: Start each MC-SAT run at

previous end state

74

Solvable by quasi-Newton, conjugate gradient, etc. But line searches require exact inference Solution: Scaled conjugate gradient

[Lowd & Domingos, 2008]

Use Hessian to choose step size Compute quadratic form inside MC-SAT Use inverse diagonal Hessian as preconditioner

Problem: Extreme Ill-Conditioning

75

Structure Learning

Standard inductive logic programming optimizesthe wrong thing

But can be used to overgenerate for L1 pruning Our approach:

ILP + Pseudo-likelihood + Structure priors For each candidate structure change:

Start from current weights & relax convergence Use subsampling to compute sufficient statistics

76

Structure Learning

Initial state: Unit clauses or prototype KBOperators: Add/remove literal, flip sign Evaluation function:

Pseudo-likelihood + Structure prior Search: Beam search, shortest-first search

77

Alchemy

Open-source software including: Full first-order logic syntaxGenerative & discriminative weight learning Structure learningWeighted satisfiability, MCMC, lifted BP Programming language features

alchemy.cs.washington.edu

78

Alchemy Prolog BUGS

Represent-ation

F.O. Logic + Markov nets

Horn clauses

Bayes nets

Inference Model checking, MCMC, lifted BP

Theorem proving

MCMC

Learning Parameters& structure

No Params.

Uncertainty Yes No Yes

Relational Yes Yes No

79

Constrained Conditional Model

Representation: Integer linear programs

Local classifiers + Global constraints Inference: LP solver Parameter learning: None for constraints Weights of soft constraints set heuristically Local weights typically learned independently

Structure learning: None to date But see latest development in NAACL-10

80

Running Alchemy

Programs Infer Learnwts Learnstruct

Options

MLN file Types (optional) Predicates Formulas

Database files

81

Overview


82

Uniform Distribn.: Empty MLN

Example: Unbiased coin flips

Type: flip = { 1, … , 20 }

Predicate: Heads(flip)

1 0

1 10 0

1(Heads( ))

2Z

Z Z

eP f

e e

83

Binomial Distribn.: Unit Clause

Example: Biased coin flips

Type: flip = { 1, … , 20 }

Predicate: Heads(flip)

Formula: Heads(f)

Weight: Log odds of heads:

By default, MLN includes unit clauses for all predicates

(captures marginal distributions, etc.)

1

1 1 0

1(Heads( ))

1

wZ

w wZ Z

eP f p

e e e

p

pw

1log

84

Multinomial Distribution

Example: Throwing die

Types: throw = { 1, … , 20 }

face = { 1, … , 6 }

Predicate: Outcome(throw,face)

Formulas: Outcome(t,f) ^ f != f’ => !Outcome(t,f’).

Exist f Outcome(t,f).

Too cumbersome!

85

Multinomial Distrib.: ! Notation

Example: Throwing die

Types: throw = { 1, … , 20 }

face = { 1, … , 6 }

Predicate: Outcome(throw,face!)

Formulas:

Semantics: Arguments without “!” determine arguments with “!”.

Also makes inference more efficient (triggers blocking).

86

Multinomial Distrib.: + Notation

Example: Throwing biased die

Types: throw = { 1, … , 20 }

face = { 1, … , 6 }

Predicate: Outcome(throw,face!)

Formulas: Outcome(t,+f)

Semantics: Learn weight for each grounding of args with “+”.

87

Logistic regression:

Type: obj = { 1, ... , n }Query predicate: C(obj)Evidence predicates: Fi(obj)Formulas: a C(x) bi Fi(x) ^ C(x)

Resulting distribution:

Therefore:

Alternative form: Fi(x) => C(x)

Logistic Regression (MaxEnt)

iiii fba

fba

CP

CP

)0exp(

explog

)|0(

)|1(log

fF

fF

iii cfbac

ZcCP exp

1),( fF

ii fbaCP

CP

)|0(

)|1(log

fF

fF

88

Hidden Markov Models

obs = { Red, Green, Yellow }state = { Stop, Drive, Slow }time = { 0, ..., 100 }

State(state!,time)Obs(obs!,time)

State(+s,0)State(+s,t) ^ State(+s',t+1)Obs(+o,t) ^ State(+s,t)

Sparse HMM: State(s,t) => State(s1,t+1) v State(s2, t+1) v ... .

89

Bayesian Networks

Use all binary predicates with same first argument (the object x).

One predicate for each variable A: A(x,v!) One clause for each line in the CPT and

value of the variable Context-specific independence:

One clause for each path in the decision tree Logistic regression: As before Noisy OR: Deterministic OR + Pairwise clauses

90

Relational Models

Knowledge-based model construction Allow only Horn clauses Same as Bayes nets, except arbitrary relations Combin. function: Logistic regression, noisy-OR or external

Stochastic logic programs Allow only Horn clauses Weight of clause = log(p) Add formulas: Head holds Exactly one body holds

Probabilistic relational models Allow only binary relations Same as Bayes nets, except first argument can vary

91

Relational Models Relational Markov networks

SQL → Datalog → First-order logic One clause for each state of a clique + syntax in Alchemy facilitates this

Bayesian logic Object = Cluster of similar/related observations Observation constants + Object constants Predicate InstanceOf(Obs,Obj) and clauses using it

Unknown relations: Second-order Markov logicS. Kok & P. Domingos, “Statistical Predicate Invention”, inProc. ICML-2007.

92

Overview


93

Text Classification

The 56th quadrennial United States presidential election was held on November 4, 2008. Outgoing Republican President George W. Bush's policies and actions and the American public's desire for change were key issues throughout the campaign. ……

The Chicago Bulls are an American professional basketball team based in Chicago, Illinois, playing in the Central Division of the Eastern Conference in the National Basketball Association (NBA). ……

……

Topic = politics

Topic = sports

94

Text Classificationpage = {1, ..., max}word = { ... }topic = { ... }

Topic(page,topic)HasWord(page,word)

Topic(p,t)HasWord(p,+w) => Topic(p,+t)

If topics mutually exclusive: Topic(page,topic!)

95

Text Classificationpage = {1, ..., max}word = { ... }topic = { ... }

Topic(page,topic)HasWord(page,word)Links(page,page)

Topic(p,t)HasWord(p,+w) => Topic(p,+t)Topic(p,t) ^ Links(p,p') => Topic(p',t)

Cf. S. Chakrabarti, B. Dom & P. Indyk, “Hypertext ClassificationUsing Hyperlinks,” in Proc. SIGMOD-1998.

96

Entity ResolutionAUTHOR: H. POON & P. DOMINGOSTITLE: UNSUPERVISED SEMANTIC PARSINGVENUE: EMNLP-09

AUTHOR: Hoifung Poon and Pedro DomingsTITLE: Unsupervised semantic parsingVENUE: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

AUTHOR: Poon, Hoifung and Domings, PedroTITLE: Unsupervised ontology induction from textVENUE: Proceedings of the Forty-Eighth Annual Meeting of the Association for Computational Linguistics

AUTHOR: H. Poon, P. DomingsTITLE: Unsupervised ontology inductionVENUE: ACL-10

SAME?

SAME?

97

Problem: Given database, find duplicate records

HasToken(token,field,record)SameField(field,record,record)SameRecord(record,record)

HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) => SameRecord(r,r’)

Entity Resolution

98

Problem: Given database, find duplicate records


HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) => SameRecord(r,r’)SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)

Cf. A. McCallum & B. Wellner, “Conditional Models of Identity Uncertaintywith Application to Noun Coreference,” in Adv. NIPS 17, 2005.

Entity Resolution

99

Can also resolve fields:


HasToken(+t,+f,r) ^ HasToken(+t,+f,r’) => SameField(f,r,r’)SameField(f,r,r’) <=> SameRecord(r,r’)SameRecord(r,r’) ^ SameRecord(r’,r”) => SameRecord(r,r”)SameField(f,r,r’) ^ SameField(f,r’,r”) => SameField(f,r,r”)

More: P. Singla & P. Domingos, “Entity Resolution with Markov Logic”, in Proc. ICDM-2006.

Entity Resolution

100

UNSUPERVISED SEMANTIC PARSING. H. POON & P. DOMINGOS. EMNLP-2009.

Unsupervised Semantic Parsing, Hoifung Poon and Pedro Domingos. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: ACL.

Information Extraction

101

UNSUPERVISED SEMANTIC PARSING. H. POON & P. DOMINGOS. EMNLP-2009.

Unsupervised Semantic Parsing, Hoifung Poon and Pedro Domingos. Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: ACL.


Author Title Venue

SAME?

102


Problem: Extract database from text orsemi-structured sources

Example: Extract database of publications from citation list(s) (the “CiteSeer problem”)

Two steps: Segmentation:

Use HMM to assign tokens to fields Entity resolution:

Use logistic regression and transitivity

103

Token(token, position, citation)InField(position, field!, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) ^ InField(i+1,+f,c)

Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)SameField(+f,c,c’) <=> SameCit(c,c’)SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”)SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)


104

Token(token, position, citation)InField(position, field!, citation)SameField(field, citation, citation)SameCit(citation, citation)

Token(+t,i,c) => InField(i,+f,c)InField(i,+f,c) ^ !Token(“.”,i,c) ^ InField(i+1,+f,c)

Token(+t,i,c) ^ InField(i,+f,c) ^ Token(+t,i’,c’) ^ InField(i’,+f,c’) => SameField(+f,c,c’)SameField(+f,c,c’) <=> SameCit(c,c’)SameField(f,c,c’) ^ SameField(f,c’,c”) => SameField(f,c,c”)SameCit(c,c’) ^ SameCit(c’,c”) => SameCit(c,c”)

More: H. Poon & P. Domingos, “Joint Inference in InformationExtraction”, in Proc. AAAI-2007.


105

Biomedical Text Mining

Traditionally, name entity recognition or information extractionE.g., protein recognition, protein-protein identification

BioNLP-09 shared task: Nested bio-events Much harder than traditional IE Top F1 around 50% Naturally calls for joint inference

106

Bio-Event Extraction

Involvement of p70(S6)-kinase activation in IL-10 up-regulation in human monocytes by gp41 envelope protein of human immunodeficiency virus type 1 ...

involvement

up-regulation

IL-10human

monocyte

SiteTheme Cause

gp41 p70(S6)-kinase

activation

Theme Cause

Theme

107

Token(position, token)DepEdge(position, position, dependency)IsProtein(position)EvtType(position, evtType)InArgPath(position, position, argType!)

Token(i,+w) => EvtType(i,+t)Token(j,w) ^ DepEdge(i,j,+d) => EvtType(i,+t)DepEdge(i,j,+d) => InArgPath(i,j,+a)Token(i,+w) ^ DepEdge(i,j,+d) => InArgPath(i,j,+a)…


Logistic regression

108

Token(position, token)DepEdge(position, position, dependency)IsProtein(position)EvtType(position, evtType)InArgPath(position, position, argType!)

Token(i,+w) => EvtType(i,+t)Token(j,w) ^ DepEdge(i,j,+d) => EvtType(i,+t)DepEdge(i,j,+d) => InArgPath(i,j,+a)Token(i,+w) ^ DepEdge(i,j,+d) => InArgPath(i,j,+a)…

InArgPath(i,j,Theme) => IsProtein(j) v (Exist k k!=i ^ InArgPath(j, k, Theme)).…

More: H. Poon and L. Vanderwende, “Joint Inference for Knowledge Extraction from Biomedical Literature”, 10:40 am, June 4, Gold Room.


Adding a few joint inference rules doubles the F1

109

Temporal Information Extraction

Identify event times and temporal relations (BEFORE, AFTER, OVERLAP)

E.g., who is the President of U.S.A.? Obama: 1/20/2009 present G. W. Bush: 1/20/2001 1/19/2009 Etc.

110

DepEdge(position, position, dependency)Event(position, event)After(event, event)

DepEdge(i,j,+d) ^ Event(i,p) ^ Event(j,q) => After(p,q)

After(p,q) ^ After(q,r) => After(p,r)


111

DepEdge(position, position, dependency)Event(position, event)After(event, event)Role(position, position, role)

DepEdge(I,j,+d) ^ Event(i,p) ^ Event(j,q) => After(p,q)Role(i,j,ROLE-AFTER) ^ Event(i,p) ^ Event(j,q) => After(p,q)

After(p,q) ^ After(q,r) => After(p,r)

More:

K. Yoshikawa, S. Riedel, M. Asahara and Y. Matsumoto, “Jointly Identifying Temporal Relations with Markov Logic”, in Proc. ACL-2009.

X. Ling & D. Weld, “Temporal Information Extraction”, in Proc. AAAI-2010.


112

Semantic Role Labeling

Problem: Identify arguments for a predicate Two steps: Argument identification:

Determine whether a phrase is an argument Role classification:

Determine the type of an argument (agent, theme, temporal, adjunct, etc.)

113

Token(position, token)DepPath(position, position, path)IsPredicate(position)Role(position, position, role!)HasRole(position, position)

Token(i,+t) => IsPredicate(i)DepPath(i,j,+p) => Role(i,j,+r)

HasRole(i,j) => IsPredicate(i)IsPredicate(i) => Exist j HasRole(i,j)HasRole(i,j) => Exist r Role(i,j,r)Role(i,j,r) => HasRole(i,j)

Cf. K. Toutanova, A. Haghighi, C. Manning, “A global joint model for semantic role labeling”, in Computational Linguistics 2008.

Semantic Role Labeling

114

Token(position, token)DepPath(position, position, path)IsPredicate(position)Role(position, position, role!)HasRole(position, position)Sense(position, sense!)

Token(i,+t) => IsPredicate(i)DepPath(i,j,+p) => Role(i,j,+r)Sense(I,s) => IsPredicate(i)

HasRole(i,j) => IsPredicate(i)IsPredicate(i) => Exist j HasRole(i,j)HasRole(i,j) => Exist r Role(i,j,r)Role(i,j,r) => HasRole(i,j)Token(i,+t) ^ Role(i,j,+r) => Sense(i,+s)

More: I. Meza-Ruiz & S. Riedel, “Jointly Identifying Predicates, Arguments and Senses using Markov Logic”, in Proc. NAACL-2009.

Joint Semantic Role Labeling and Word Sense Disambiguation

115

Practical Tips: Modeling

Add all unit clauses (the default)How to handle uncertain data:R(x,y) ^ R’(x,y) (the “HMM trick”)

Implications vs. conjunctionsFor soft correlation, conjunctions often better Implication: A => B is equivalent to !(A ^ !B) Share cases with others like A => C Make learning unnecessarily harder

116

Practical Tips: Efficiency

Open/closed world assumptionsLow clause aritiesLow numbers of constantsShort inference chains

117

Practical Tips: Development

Start with easy componentsGradually expand to full taskUse the simplest MLN that worksCycle: Add/delete formulas, learn and test

118

Overview


119

Unsupervised Learning: Why?

Virtually unlimited supply of unlabeled text Labeling is expensive (Cf. Penn-Treebank)Often difficult to label with consistency and

high quality (e.g., semantic parses) Emerging field: Machine reading

Extract knowledge from unstructured text with high precision/recall and minimal human effort

Check out LBR-Workshop (WS9) on Sunday

120

Unsupervised Learning: How?

I.i.d. learning: Sophisticated model requires more labeled data

Statistical relational learning: Sophisticated model may require less labeled data Relational dependencies constrain problem space One formula is worth a thousand labels

Small amount of domain knowledge large-scale joint inference

121

Ambiguities vary among objects Joint inference Propagate information from

unambiguous objects to ambiguous ones E.g.:

G. W. Bush …

He …

…

Mrs. Bush …

Unsupervised Learning: How?

Are they coreferent?

122



G. W. Bush …

He …

…

Mrs. Bush …

Unsupervised Learning: How

Should be coreferent

123



G. W. Bush …

He …

…

Mrs. Bush …


So must be singular male!

124



G. W. Bush …

He …

…

Mrs. Bush …


Must be singular female!

125



G. W. Bush …

He …

…

Mrs. Bush …


Verdict: Not coreferent!

126

| ,log ( ) ( , ) ( , )z x i x z ii

P x E n x z E n x zw

Marginalize out hidden variables

Use MC-SAT to approximate both expectationsMay also combine with contrastive estimation

[Poon & Cherry & Toutanova, NAACL-2009]

Parameter Learning

Sum over z, conditioned on observed x

Summed over both x and z

127

Unsupervised Coreference Resolution

Head(mention, string)Type(mention, type)MentionOf(mention, entity)

MentionOf(+m,+e)Type(+m,+t)Head(+m,+h) ^ MentionOf(+m,+e)

MentionOf(a,e) ^ MentionOf(b,e) => (Type(a,t) <=> Type(b,t))

… (similarly for Number, Gender etc.)

Mixture model

Joint inference formulas: Enforce agreement

128

Unsupervised Coreference Resolution

Head(mention, string)Type(mention, type)MentionOf(mention, entity)Apposition(mention, mention)

MentionOf(+m,+e)Type(+m,+t)Head(+m,+h) ^ MentionOf(+m,+e)

MentionOf(a,e) ^ MentionOf(b,e) => (Type(a,t) <=> Type(b,t))

… (similarly for Number, Gender etc.)

Apposition(a,b) => (MentionOf(a,e) <=> MentionOf(b,e))

More: H. Poon and P. Domingos, “Joint Unsupervised Coreference Resolution with Markov Logic”, in Proc. EMNLP-2008.

Joint inference formulas: Leverage apposition

129

Relational Clustering: Discover Unknown PredicatesCluster relations along with objectsUse second-order Markov logic

[Kok & Domingos, 2007, 2008]

Key idea: Cluster combination determines likelihood of relationsInClust(r,+c) ^ InClust(x,+a) ^ InClust(y,+b) => r(x,y)

Input: Relational tuples extracted by TextRunner [Banko et al., 2007]

Output: Semantic network

130

Recursive Relational Clustering

Unsupervised semantic parsing [Poon & Domingos, EMNLP-2009]

Text Knowledge Start directly from text Identify meaning units + Resolve variations Use high-order Markov logic (variables over

arbitrary lambda forms and their clusters) End-to-end machine reading:

Read text, then answer questions

131

Semantic Parsing

induces

protein CD11b

nsubj dobj

IL-4

nn

induces

protein CD11b

nsubj dobj

IL-4

nn

INDUCE

INDUCER INDUCED

IL-4

CD11B

INDUCE(e1)

INDUCER(e1,e2) INDUCED(e1,e3)

IL-4(e2) CD11B(e3)

IL-4 protein induces CD11b

Structured prediction: Partition + Assignment

132

Challenge: Same Meaning, Many Variations

IL-4 up-regulates CD11b

Protein IL-4 enhances the expression of CD11b

CD11b expression is induced by IL-4 protein

The cytokin interleukin-4 induces CD11b expression

IL-4’s up-regulation of CD11b, …

……

133

Unsupervised Semantic Parsing

USP Recursively cluster arbitrary expressions composed with / by similar expressions

IL-4 induces CD11b


CD11b expression is enhanced by IL-4 protein



134



IL-4 induces CD11b





Cluster same forms at the atom level

135

Cluster forms in composition with same forms



IL-4 induces CD11b





136



IL-4 induces CD11b






137



IL-4 induces CD11b






138



IL-4 induces CD11b






139


Exponential prior on number of parameters Event/object/property cluster mixtures:InClust(e,+c) ^ HasValue(e,+v)

Object/Event Cluster: INDUCE

induces 0.1

enhances 0.4

…

…

Property Cluster: INDUCER

0.5

0.4…

IL-4 0.2

IL-8 0.1

…

None 0.1

One 0.8

…

nsubj

agent

140

But … State Space Too Large

Coreference: #-clusters #-mentionsUSP: #-clusters exp(#-tokens) Also, meaning units often small and

many singleton clusters

Use combinatorial search

141

Inference: Hill-Climb Probability

Initialize

Search Operator

Lambda reduction

induces

protein CD11B

nsubj dobj

IL-4

nn

?

? ?

?

?

?

?

protein

IL-4

nn

protein

IL-4

nn

?

?

?

?

142

Learning: Hill-Climb Likelihood

enhances 1induces 1 protein 1IL-4 1

MERGE COMPOSE

IL-4 protein 1induces 0.2enhances 0.8

…Initialize

Search Operator

enhances 1induces 1 protein 1IL-4 1

143

Unsupervised Ontology Induction

Limitations of USP: No ISA hierarchy among clusters Little smoothing Limited capability to generalize

OntoUSP [Poon & Domingos, ACL-2010]

Extends USP to also induce ISA hierarchy Joint approach for ontology induction, population,

and knowledge extraction To appear in ACL (see you in Uppsala :-)

144

OntoUSP Modify the cluster mixture formula

InClust(e,c) ^ ISA(c,+d) ^ HasValue(e,+v) Hierarchical smoothing + clustering New operator in learning:

induces 0.30.1

…

enhances

ISA ISA

inhibits 0.2suppresses 0.1

induces 0.6

up-regulates 0.2

…

INDUCE

INHIBITinhibits 0.4

0.2

…

suppresses

ABSTRACTION

INHIBIT

inhibits 0.4

0.2

…

suppressesinduces 0.6

up-regulates 0.2

…

INDUCE

MERGE with

REGULATE?

145

End of The Beginning …

Not merely a user guide of MLN and Alchemy Statistical relational learning:

Growth area for machine learning and NLP

146

Future Work: Inference

Scale up inference Cutting-planes methods (e.g., [Riedel, 2008]) Unify lifted inference with sampling Coarse-to-fine inference

Alternative technologyE.g., linear programming, lagrangian relaxation

147

Future Work: Supervised Learning

Alternative optimization objectivesE.g., max-margin learning [Huynh & Mooney, 2009]

Learning for efficient inferenceE.g., learning arithmetic circuits [Lowd & Domingos, 2008]

Structure learning: Improve accuracy and scalabilityE.g., [Kok & Domingos, 2009]

148

Future Work: Unsupervised Learning

Model: Learning objective, formalism, etc. Learning: Local optima, intractability, etc.Hyperparameter tuning Leverage available resources Semi-supervised learning Multi-task learning Transfer learning (e.g., domain adaptation)

Human in the loopE.g., interative ML, active learning, crowdsourcing

149

Future Work: NLP Applications

Existing application areas: More joint inference opportunities Additional domain knowledge Combine multiple pipeline stages

A “killer app”: Machine readingMany, many more awaiting YOU to discover

150

SummaryWe need to unify logical and statistical NLPMarkov logic provides a language for this Syntax: Weighted first-order formulas Semantics: Feature templates of Markov nets Inference: Satisfiability, MCMC, lifted BP, etc. Learning: Pseudo-likelihood, VP, PSCG, ILP, etc.

Growing set of NLP applicationsOpen-source software: Alchemy

Book: Domingos & Lowd, Markov Logic,Morgan & Claypool, 2009.

alchemy.cs.washington.edu

151

References[Banko et al., 2007] Michele Banko, Michael J. Cafarella, Stephen

Soderland, Matt Broadhead, Oren Etzioni, "Open Information Extraction From the Web", In Proc. IJCAI-2007.

[Chakrabarti et al., 1998] Soumen Chakrabarti, Byron Dom, Piotr Indyk, "Hypertext Classification Using Hyperlinks", in Proc. SIGMOD-1998.

[Damien et al., 1999] Paul Damien, Jon Wakefield, Stephen Walker, "Gibbs sampling for Bayesian non-conjugate and hierarchical models by auxiliary variables", Journal of the Royal Statistical Society B, 61:2.

[Domingos & Lowd, 2009] Pedro Domingos and Daniel Lowd, Markov Logic, Morgan & Claypool.

[Friedman et al., 1999] Nir Friedman, Lise Getoor, Daphne Koller, Avi Pfeffer, "Learning probabilistic relational models", in Proc. IJCAI-1999.

152

References[Halpern, 1990] Joe Halpern, "An analysis of first-order logics of

probability", Artificial Intelligence 46.

[Huynh & Mooney, 2009] Tuyen Huynh and Raymond Mooney, "Max-Margin Weight Learning for Markov Logic Networks", In Proc. ECML-2009.

[Kautz et al., 1997] Henry Kautz, Bart Selman, Yuejun Jiang, "A general stochastic approach to solving problems with hard and soft constraints", In The Satisfiability Problem: Theory and Applications. AMS.

[Kok & Domingos, 2007] Stanley Kok and Pedro Domingos, "Statistical Predicate Invention", In Proc. ICML-2007.

[Kok & Domingos, 2008] Stanley Kok and Pedro Domingos, "Extracting Semantic Networks from Text via Relational Clustering", In Proc. ECML-2008.

153

References[Kok & Domingos, 2009] Stanley Kok and Pedro Domingos, "Learning

Markov Logic Network Structure via Hypergraph Lifting", In Proc. ICML-2009.

[Ling & Weld, 2010] Xiao Ling and Daniel S. Weld, "Temporal Information Extraction", In Proc. AAAI-2010.

[Lowd & Domingos, 2007] Daniel Lowd and Pedro Domingos, "Efficient Weight Learning for Markov Logic Networks", In Proc. PKDD-2007.

[Lowd & Domingos, 2008] Daniel Lowd and Pedro Domingos, "Learning Arithmetic Circuits", In Proc. UAI-2008.

[Meza-Ruiz & Riedel, 2009] Ivan Meza-Ruiz and Sebastian Riedel, "Jointly Identifying Predicates, Arguments and Senses using Markov Logic", In Proc. NAACL-2009.

154

References[Muggleton, 1996] Stephen Muggleton, "Stochastic logic programs", in

Proc. ILP-1996.

[Nilsson, 1986] Nil Nilsson, "Probabilistic logic", Artificial Intelligence 28.

[Page et al., 1998] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd, "The PageRank Citation Ranking: Bringing Order to the Web", Tech. Rept., Stanford University, 1998.

[Poon & Domingos, 2006] Hoifung Poon and Pedro Domingos, "Sound and Efficient Inference with Probabilistic and Deterministic Dependencies", In Proc. AAAI-06.

[Poon & Domingos, 2007] Hoifung Poon and Pedro Domingo, "Joint Inference in Information Extraction", In Proc. AAAI-07.

155

References[Poon & Domingos, 2008a] Hoifung Poon, Pedro Domingos, Marc

Sumner, "A General Method for Reducing the Complexity of Relational Inference and its Application to MCMC", In Proc. AAAI-08.

[Poon & Domingos, 2008b] Hoifung Poon and Pedro Domingos, "Joint Unsupervised Coreference Resolution with Markov Logic", In Proc. EMNLP-08.

[Poon & Domingos, 2009] Hoifung and Pedro Domingos, "Unsupervised Semantic Parsing", In Proc. EMNLP-09.

[Poon & Cherry & Toutanova, 2009] Hoifung Poon, Colin Cherry, Kristina Toutanova, "Unsupervised Morphological Segmentation with Log-Linear Models", In Proc. NAACL-2009.

156

References[Poon & Vanderwende, 2010] Hoifung Poon and Lucy Vanderwende,

"Joint Inference for Knowledge Extraction from Biomedical Literature", In Proc. NAACL-10.

[Poon & Domingos, 2010] Hoifung and Pedro Domingos, "Unsupervised Ontology Induction From Text", In Proc. ACL-10.

[Riedel 2008] Sebatian Riedel, "Improving the Accuracy and Efficiency of MAP Inference for Markov Logic", In Proc. UAI-2008.

[Riedel et al., 2009] Sebastian Riedel, Hong-Woo Chun, Toshihisa Takagi and Jun'ichi Tsujii, "A Markov Logic Approach to Bio-Molecular Event Extraction", In Proc. BioNLP 2009 Shared Task.

[Selman et al., 1996] Bart Selman, Henry Kautz, Bram Cohen, "Local search strategies for satisfiability testing", In Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Challenge. AMS.

157

References[Singla & Domingos, 2006a] Parag Singla and Pedro Domingos,

"Memory-Efficient Inference in Relational Domains", In Proc. AAAI-2006.

[Singla & Domingos, 2006b] Parag Singla and Pedro Domingos, "Entity Resolution with Markov Logic", In Proc. ICDM-2006.

[Singla & Domingos, 2007] Parag Singla and Pedro Domingos, "Markov Logic in Infinite Domains", In Proc. UAI-2007.

[Singla & Domingos, 2008] Parag Singla and Pedro Domingos, "Lifted First-Order Belief Propagation", In Proc. AAAI-2008.

[Taskar et al., 2002] Ben Taskar, Pieter Abbeel, Daphne Koller, "Discriminative probabilistic models for relational data", in Proc. UAI-2002.

158

References[Toutanova & Haghighi & Manning, 2008] Kristina Toutanova, Aria

Haghighi, Chris Manning, "A global joint model for semantic role labeling", Computational Linguistics.

[Wang & Domingos, 2008] Jue Wang and Pedro Domingos, "Hybrid Markov Logic Networks", In Proc. AAAI-2008.

[Wellman et al., 1992] Michael Wellman, John S. Breese, Robert P. Goldman, "From knowledge bases to decision models", Knowledge Engineering Review 7.

[Yoshikawa et al., 2009] Katsumasa Yoshikawa, Sebastian Riedel, Masayuki Asahara and Yuji Matsumoto, "Jointly Identifying Temporal Relations with Markov Logic", In Proc. ACL-2009.

Documents

Markov Logic in Natural Language Processing Hoifung Poon Dept. of Computer Science & Eng. University of Washington