Machine Learning in Natural Language

1

Machine Learning in

Natural Language

1. No Lecture on Thursday.2. Instead:

• Monday, 4pm, 1404SC • Mark Johnson lectures on: Bayesian Models of

Language Acquisition

2

Machine Learning in

Natural Language

Features and Kernels

1. The idea of kernels• Kernel Perceptron

2. Structured Kernels• Tree and Graph Kernels

3. Lessons• Multi-class classification

3

Weather

Whether

523341321 xxxxxxxxx 541 yyy

New discriminator in functionally simpler

EmbeddingCan be done explicitly (generate expressive features)or implicitly (use kernels).

4

M f(x)

zz))S(z)K(x,(Th

A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector.

Computing the weight vector is done in the original space.

Notice: this pertains only to efficiency. Generalization is still relative to the real

dimensionality. This is the main trick in SVMs. (Algorithm - different)

(although many applications actually use linear kernels).

Kernel Based Methods

5

(demotion) 1)x (if 1- w w,xbut w 0Class If)(promotion 1)x (if 1 w w,xwbut 1Class If

iii

iii

)xxw(Th f(x)

R w:Hypothesis ;{0,1} x :Examplesn

1i ii

nn

)(

Let I be the set t1,t2,t3 …of monomials (conjunctions) over The feature space x1, x2… xn.

Then we can write a linear function over this new feature space.

)xtw(Th f(x) i ii

I

)(

0 (11010)xx 1 (11010)xxx :Example 43421

Kernel Base Methods

6

nn R w:Hypothesis ;{0,1} x :Examples

Great Increase in expressivity Can run Perceptron, Winnow, Logistics regression, but the

convergence bound may suffer exponential growth.

Exponential number of monomials are true in each example.

Also, will have to keep many weights.

)xtw(Th f(x) i ii

I

)(


iii

iii


7


• Consider the value of w used in the prediction.• Each previous mistake, on example z, makes an

additive contribution of +/-1 to w, iff t(z) = 1.• The value of w is determined by the number of

mistakes on which t() was satisfied.

)xtw(Th f(x) i ii

I

)(


iii

iii

The Kernel Trick(1)

8


• P – set of examples on which we Promoted• D – set of examples on which we Demoted• M = P D

I

I

((

)(

i iiMz

i i1(z)tD,z1(z)tP,z

x)z)ttS(z)(Th

)xt11(Th f(x)ii

)xtw(Th f(x) i ii

I

)(


iii

iii

The Kernel Trick(2)

9

• P – set of examples on which we Promoted• D – set of examples on which we Demoted• M = P D

• Where S(z)=1 if z P and S(z) = -1 if z D. Reordering:

)xtw(Th f(x) i ii

I

)(

I

I

((

)(

i iiMz

i i1(z)tD,z1(z)tP,z

x)z)ttS(z)(Th

)xt11(Th f(x)ii

M

I

(( f(x) z

iii ))xz)ttS(z)(Th

The Kernel Trick(3)

10

• S(y)=1 if y P and S(y) = -1 if y D.

M

I

(( f(x) z

iii ))xz)ttS(z)(Th

• A mistake on z contributes the value +/-1 to all monomials satisfied by z. The total contribution of z to the sum is equal to the number of monomials that satisfy both x and z.

• Define a dot product in the t-space:

M f(x)

zz))S(z)K(x,(Th

)xz)tt z)K(x,i

ii

I

((

)xtw(Th f(x) i ii

I

)(

• We get the standard notation:

The Kernel Trick(4)

11

What does this representation give us?

We can view this Kernel as the distance between x,z measured in the t-space.

But, K(x,z) can be computed in the original space, without explicitly writing the t-representation of x, z

M f(x)

zz))S(z)K(x,(Th

)xz)tt z)K(x,i

ii

I

((


12

M f(x)

zz))S(z)K(x,(Th )xz)tt z)K(x,

iii

I

((

• Consider the space of all 3n monomials (allowing both positive and negative literals).

• Then, • if same(x,z) is the number of features that

have the same value for both x and z.. We get:

• Example: Take n=2; x=(00), z=(01), ….• Proof: let k=same(x,z); choose to (1)include

the literal with the right polarity in the monomial, or (2) not include at all.

• Other Kernels can be used.

z)same(x,2 z)K(x,

M f(x)

zz)same(x, )S(z)(2(Th


13

M f(x)

zz))S(z)K(x,(Th

)xz)tt z)K(x,i

ii

I

((

• Simply run Perceptron in an on-line mode, but keep track of the set M.

• Keeping the set M allows to keep track of S(z).

• Rather than remembering the weight vector w,

remember the set M (P and D) – all those examples on which we made mistakes.

Dual Representation

Implementation

14

M f(x)

zz))S(z)K(x,(Th

• A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector.

• Computing the weight vector can still be done in the original feature space.

• Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space.

• Generalization is still relative to the real dimensionality.

• This is the main trick in SVMs. (Algorithm - different) (although most applications actually use linear kernels)

Summary – Kernel Based Methods I

15

• Separating hyperplanes (produced by Perceptron, SVM) can be computed in terms of dot products over a feature based representation of examples.

• We want to define a dot product in a high dimensional space.

• Given two examples x = (x1, x2, …xn) and y = (y1,y2, …yn) we want to map them to a high dimensional space [example- quadratic]:

(x1,x2,…xn) = (x1,…xn, x12,…xn

2, x1 ¢ x2, …,xn-1¢ xn) (y1,y2,…yn) = (y1,…yn ,y1

2,…yn2, y1 ¢ y2,…,yn-1¢ yn)

• And compute the dot product A = (x) ¢ (y) [takes time ]

• Instead, in the original space, compute B = f(x ¢ y)= [1+ (x1,x2, …xn) ¢ (y1,y2, …yn)]2

• Theorem: A = BCoefficients do not really matter; can be done for other functions.

Summary – Kernel Trick

p2 p2 p2

16

There is a tradeoff between the computational efficiency with which these kernels can be computed and the generalization ability of the classifier.

For example, using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions.

In addition, computing with kernels depends strongly on the number of examples. It turns out that sometimes working in the blown up space is more efficient than using kernels.

Next: More Complicated Kernels

Efficiency-Generalization Tradeoff

17

Structured Input

join

John

will

the

board as

adirector 2G

afternoon, Dr. Ab C …in Ms. De. F class..[NP Which type] [PP of ] [NP submarine] [VP was bought ] [ADVPrecently ] [PP by ] [NP South Korea ] (. ?)

S = John will join the board as a director 1G

Word=POS=IS-A=…

Knowledge Representation

18

We want to extract features from structured domain elements their internal (hierarchical) structure should be encoded.

A feature is a mapping from the instances space to {0,1} or [0,1]

With appropriate representation language it is possible to represent expressive features that constitute infinite dimensional space [FEX]

Learning can be done in the infinite attribute domain.

What does it mean to extract features? Conceptually: different data instantiations may be abstracted to yield

the same representation (quantified elements) Computationally: Some kind of graph matching process

Challenge: Provide the expressivity necessary to deal with large scale and highly

structured domains Meet the strong tractability requirements for these tasks.

Learning From Structured Input

19

Only those descriptions that are ACTIVE in the input are listed

Michael Collins developed kernels over parse trees. Cumby/Roth developed parameterized kernels over

structures. When is it better to use kernel vs. using the primal

representation.

D = (AND word (before tag))

Explicit features

Example

20

Overview – Goals (Cumby&Roth 2003)

Applying kernel learning methods to structured domains.

Develop a unified formalism for structured kernels. (Collins & Duffy, Gaertner & Lloyd, Haussler)

Flexible language that measures distance between structure with respect to a given ‘substructure’.

Examine complexity & generalization between different feature sets, learners.

When does each type of feature set perform better with what learners?

Exemplify with experiments from bioinformatics & NLP. Mutagenesis, Named-Entity prediction.

21

A flexible knowledge representation for feature extraction from structured data

Domain Elements are represented as labeled graphs Concept graphs that correspond to FDL expressions.

FDL is formed from an alphabet of attributes, value, and role symbols.

Well defined syntax and equivalent semantics E.g., descriptions are defined inductively with sensors as primitives

Sensor: a basic description – a term of the form a(v), or a a = attribute symbol, v = value symbol (ground sensor). existential sensor a describes object that has some value for

attribute a. AND clauses, (role D) clauses for relations between objects, Expressive and Efficient Feature extraction.

Feature Description Logic

Knowledge Representation

22

Example (Cont.) Features; Feature Generation Functions; extensions

Subsumption… (see paper) Basically:

Only those descriptions that are ACTIVE in the input are listed

The language is expressive enough to generate linguistically interesting features such as agreements, etc.

D = (AND word (before tag))

{Dθ} = {(AND word(the) (before tag(N)), (AND word(dog) (before tag(V)), (AND word(ran) (before tag(ADV)), (AND word(very) (before tag(ADJ))}

Explicit features

23

Kernels

It’s possible to define FDL based Kernels for structured data

When using linear classifiers it is important to enhance the set of features to gain expressivity.

A common way - blow up the feature space by generating functions of primitive features.

For some algorithms – SVM, Perceptron - Kernel functions can be used to expand the feature space while working still in the original space.

Kernels

• Is it worth doing in structured domains?• Answers are not clear so far

–Computationally: yes, when we simulate a huge space–Generalization: not always [Khardon,Roth,Servedio,NIPS’01; Ben David et al.]

24

Kernels in Structured Domains

We define a Kernel family K parameterized by FDL descriptions.

The definition is recursive on the definition of D [sensor, existential sensor; role description; AND]Key: Many previous structured kernels considered

all substructures. (e.g., Collins&Duffy02, Tree Kernels); Analogous to an exponential feature space; over fitting.

111 222

),(),( 2121GNn GNn

DD nnkGGk

If feature space is explicitly expanded – can use algorithms such as Winnow (SNoW); [complexity and experimental results]

Generalization issues & Computation issues [if # of examples large]

Kernels

25

FDL Kernel Definition

Kernel family K parameterized by feature type descriptions. For description D :

If D is a sensor s(v) is a label of then

If D is a sensor s and sensor descriptions s(v1), s(v2)… s(vj) are labels of both then

If D is a role description (r D’), then with n1’, n2’ those nodes that have r –labeled edge from n1,n2.

If D is a description (AND D1 D2 ... Dn) with li repetitions of any Di then

11 22

),(),( 2121Nn Nn

DD nnkGGk

21,nn 1),( 21 nnkD

jnnkD ),( 2121,nn

' ' 21'21

1 2)','(),(

n n DD nnknnk

n

ii

DD l

nnknnk

121'

21

),(),(

Kernels

26

Kernel Example

D = (AND word (before word)) G1: The dog ran very fast G2: The dog ran quickly

Etc. the final output is 2 since there are 2 matching collocations. Can simulate Boolean kernels as seen in Khardon,Roth et al.

11 22

),(),( 2121Nn Nn

DD nnkGGk

),(),(),( 21) (

2121thethewordbeforethethewordthetheD nnknnknnk

11),(1 21 dogdogword nnk

Kernels

27

Complexity & Generalization

How to compare in complexity and generalization to other kernels for structured data?

for m examples, with average example size g, and time to evaluate the kernel t1, kernel Perceptron takes O(m2g2t1)

if extracting a feature explicitly takes t2 , Perceptron takes O(mgt2). most kernels that simulate a well defined feature space have t1 <<

t2 . By restricting size of expanded feature space we avoid overfitting –

even SVM suffers under many irrelevant features (Weston). Margin argument: Margin goes down when you have more features.

given a linearly separable set of points S = {x1,…xm} 2 Rn with separator w 2 Rn

embed S into an n’>n dimensional space by adding zero-mean random noise e to the additional n’-n dimensions s.t. w’= (w,0) 2 Rn’ still separates S.

Now margin

but &iii xwexwxw ),0()0,('' ewxwexwxw T

iT

iT

iT ),()0,(''

Analysis

28

Experiments

Serve as comparison – Our features w/ kernel Perc, normal Winnow, and all-subtrees expanded features.

Bioinformatics experiment in mutagenesis prediction: 188 compounds with atom-bond data, binary prediction. 10-fold cross validation with 12 runs training

NLP experiment in classifying detected NE’s: 4700 training 1500 test phrases from MUC-7 person, location, & organization

Trained and tested with kernel Perceptron, Winnow (Snow) classifiers with FDL kernel & respective features. Also all-subtrees kernel based on Collins & Duffy work.

Mutagenesis concept graph Features simulated with all-subtrees kernel

29

Discussion

microaveraged accuracy Have kernel that simulates features obtained with FDL

But quadratic training time means cheaper to extract and learn explicitly vs kernel Perceptron

SVM could take (slightly) even longer, but maybe perform better

But restricted features might work better than larger spaces simulated by other kernels.

Can we improve on benefits of useful features? Compile examples together ? More sophisticated kernels than matching kernel?

Still provides metric for similarity based approaches.

30

Conclusion

Kernels for learning from structured data is an interesting idea

Different kernels may expand/restrict the hypothesis space in useful ways.

Need to know the benefits and hazards To justify these methods we must embed in a space much

larger than the training set size. Can decrease margin

Expressive knowledge representations can be used to create features explicitly or in implicit kernel-spaces.

Data representation could allow us to plug in different base kernels to replace matching kernel.

Parameterized kernel allows us to direct the way the feature space is blown up to encode background knowledge.

Documents

Machine Learning in Natural Language