Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Comparing Convolution Kernelsand Recursive Neural Networks

for Learning Preferences on Structured Data

Sauro Menchetti, Fabrizio Costa, Paolo FrasconiDepartment of Systems and Computer Science

Università di Firenze, Italyhttp://www.dsi.unifi.it/neural/

Massimiliano PontilDepartment of Computer Science

University College London, UK

ANNPR 2003, Florence 12-13 September 2003

2

Structured Data

Many applications…

… is useful to represent the objects of the domain by structured data (trees, graphs, …)

… better capture important relationships between the sub-parts that compose an object


3

Natural Language: Parse Trees

He was viceprevious president

.

SS

VPVP

NPNP ADVPADVP NPNP

PRPPRP VBDVBD RBRB NNNN NNNN ..


4

Structural Genomics:Protein Contact Maps


5

Document Processing: XY-Trees

0 0.00 1.00 0.00 1.00

1 0.00 0.23 0.00 1.00

1 0.23 0.26 0.00 1.00

1 0.23 0.29 0.00 1.00

1 0.00 0.23 0.00 0.01

1 0.00 0.23 0.02 1.00

1 0.00 0.23 0.00 0.01

1 0.01 0.12 0.02 1.00

1 0.01 0.12 0.23 1.00


6

Predictive Toxicology, QSAR:Chemical Compounds as

Graphs

CH3(CH(CH3,CH2(CH2(CH3))))

CH3

CH

CH3 CH2

CH2

CH3

[-1,-1,-1,1]([-1,1,-1,-1]([-1,-1,-1,1],[-1,-1,1,-1]([-1,-1,1,-1]([-1,-1,-1,1]))))


7

Ranking vs. Preference

Ranking

Preference

55 334422

11


8

Preference on Structured Data


9

Classification, Regression and Ranking

Supervised learning taskf:X→Y

ClassificationY is a finite unordered set

RegressionY is a metric space (reals)

Ranking and PreferenceY is a finite ordered setY is a non-metric space

Metric space

Metric spaceUnorderedUnordered

Non-metric space

Finite Ordered

Non-metric space

Finite Ordered

Classification

Regression

Ranking and Preference

The Target Space


10

Learning on Structured Data

Learning algorithms on discrete structures often derive from vector based methods

Both Kernel Machines and RNNs are suitable for learning on structured domains

Conventional Learning Algorithms

Conventional Learning Algorithms

1 2


11

Kernels vs. RNNs

Kernel MachinesVery high-dimensional feature space

How to choose the kernel? prior knowledge, fixed representation

Minimize a convex functional (SVM)

Recursive Neural NetworksLow-dimensional space

Task-driven: representation depends on the specific learning task

Learn an implicit encoding of relevant information

Problem of local minima


12

A Kernel for Labeled Trees

Feature SpaceSet of all tree fragments (subtrees) with the only constraint that a father can not be separated from his children

Φn(t) = # occurences of tree fragment n in t

Bag of “something”A tree is represented by

Φ(t) = [Φ1(t),Φ2(t),Φ3(t), …]

K(t,s) = Φ(t)∙Φ(s) is computed efficiently by dynamic programming (Collins & Duffy, NIPS 2001)

A

C

A

B

BC

A

CB C

A B

B

C

A

C

A

B

B

A

CB

C

A

C

A

B

BC

Φ


13

Recursive Neural Networks

Composition of two adaptative functionsφ transition function

o output function

φ,o functions are implemented by feedforward NNs

Both RNN parameters and representation vectors are found by maximizing the likelihood of training data

AA

CCDD

AA CCBB

outputspace

outputspace

φw:X→Rn ow’:Rn→O


14

Recursive Neural Networks

Labeled Tree

Network UnfoldingPrediction

PhaseError

Correction

D B B

A

C E

output network


15

Preference Models

Kernel Preference ModelBinary classification of pairwise differences between instances

RNNs Preference ModelProbabilistic model to find the best alternative

Both models use an utility function to evaluate the importance of an element


16

Utility Function Approach

Modelling of the importance of an objectUtility function U:X→Rx>z ↔ U(x)>U(z)

If U is linearU(x)>U(z) ↔ wTx>wTz

U can be also model by a neural networkRanking and preference problems

Learn U and then sort by U(x)

U(z)=3 U(x)=11


17

Kernel Preference Model

x1 = best of (x1,…,xr)

Create a set of pairs between x1 and x2,…,xr

Set of constraints if U is linearU(x1)>U(xj) ↔ wTx1>wTxj ↔ wT(x1-xj)>0 for j=2,…,r

x1-xj can be seen as a positive example

Binary classification of differences between instancesx → Φ(x): the process can be easily kernelizedNote: this model does not take into consideration all the alternatives together, but only two by two


18

RNNs Preference Model

Set of alternatives (x1,x2,…,xr)

U modelled by a recursive neural network architecture

Compute U(xi) = o(φ(xi)) for i=1,…,ri

j

U(x )

i rU(x )

j=1

eo =

eThe error (yi - oi) is backpropagated through whole network

Note: the softmax function compares all the alternatives together at once

Softmax function


19

Learning Problems

First Pass Attachment

Modeling of a psycolinguistic phenomenon

Reranking Task

Reranking the parse trees output by a statistical parser


20

First Pass Attachment (FPA)

The grammar introduces some ambiguitiesA set of alternatives for each word but only one is correct

The first pass attachment can be modelled as a preference problem

4

It has no bearing 1 432

on

NP

PP

IN

NP

PP

ADVP

IN

NP

ADJP

NP

QP

IN

NP

SBAR

ADVP

NONE

NP

PRN

INPRP VBZ DT NN

PRP

NP

NP

VP

S

It has no bearing

1

32

PRP VBZ DT NN

PRP

NP

NP

VP

S


21

Heuristics forPrediction Enhancement

Specializing the FPA prediction for each class of wordGroup the words in 10 classes (verbs, articles, …)

Learn a different classifier for each class of words

Removing nodes from the parse tree that aren’t important for choosing between different alternatives

Tree reduction

Evaluation Measure =# correct trees ranked in first position

total number of sets


22

Experimental Setup

Wall Street Journal (WSJ) Section of Penn TreeBankRealistic Corpus of Natural Language

40,000 sentences, 1 million wordsAverage sentence length: 25 words

Standard Benchmark in Computational LinguisticsTraining on sections 2-21, test on section 23 and validation on section 24


23

Voted Perceptron (VP)

FPA + WSJ = 100 million trees for trainingVoted Perceptron instead of SVM (Freund & Schapire, Machine Learning 1999)

Online algorithm for binary classification of instances based on perceptron algorithm (simple and efficient)Prediction value: weighted sum of all training weight vectorsPerformance comparable to maximal-margin classifiers (SVM)


24

Kernel VP vs. RNNsNoun - 33%

86

88

90

92

94

96

98

100 500 2000 10000 40000

Training Set Size

VPRNN

Verb - 13.4%

80

85

90

95

100

100 500 2000 10000 40000

Training Set Size

VPRNN

Preposition - 12.6%

50

55

60

65

70

100 500 2000 10000 40000

Training Set Size

VPRNN

Article - 12.5%

65

70

75

80

85

90

95

100 500 2000 10000 40000

Training Set Size

VPRNN


25

Kernel VP vs. RNNsPunctuation - 11.7%

455055606570758085

100 500 2000 10000 40000

Training Set Size

VPRNN

Adjective - 7.5%

757779818385878991

100 500 2000 10000 40000

Training Set Size

VPRNN

Adverb - 4.3%

3035404550556065

100 500 2000 10000 40000

Training Set Size

VPRNN

Conjuction - 2.3%

5055606570758085

100 500 2000 10000 40000

Training Set Size

VPRNN


26

Kernel VP vs. RNNsModularization

Learning Curve

70

75

80

85

90

100 500 2000 10000 40000

Training Set Size

VPRNN


27

Small Datasets No Modularization

72

73

74

75

76

77

78

79

Split 1 Split 2 Split 3 Split 4 Split 5

5 Independent Splits of 100 Sentences

VP - Average 75.4RNN - Average 77


28

Complexity Comparison

VP does not scale linearly with the number of training examples as the RNNs do

Computational costSmall datasets

5 splits of 100 sentences ~ a week @ 2GHz CPU

CPU(VP) ~ CPU(RNN)

Large datasets (all 40,000 sentences)VP took over 2 months to complete an epoch @ 2GHz CPU

RNN learns in 1-2 epochs ~ 3 days @ 2GHz CPU

VP is smooth in respect to training iterations


29

Reranking Task

Reranking problem: rerank the parse trees generated by a statistical parser

Same problem setting of FPA (preference on forests)

1 forest/sentence vs. 1 forest/word (less computational cost involved)

StatisticalParser

StatisticalParser


30

Evaluation: Parseval Measures

Standard evaluation measure

Labeled Precision (LP)

Labeled Recall (LR)

Crossing Brackets (CBs)

Compare a parse from a parser with an hand parsing of a sentence


31

Reranking Task

6065707580859095

100

All Sentences

VP RNN

Model

≤ 40 Words (2245 sentences)

LR LP CBs0 CBs

2 CBs

VP 89.1 89.4 0.85 69.3 88.2

RNN 89.2 89.5 0.84 67.9 88.4

Model

≤ 100 Words (2416 sentences)

LR LP CBs0 CBs

2 CBs

VP 88.6 88.9 0.99 66.5 86.3

RNN 88.6 88.9 0.98 64.8 86.3


32

Why RNNs outperform Kernel VP?

Hypothesis 1

Kernel Function: feature space not focused on the specific learning task

Hypothesis 2

Kernel Preference Model worst than RNNs preference model


33

Linear VP on RNN Representation

Checking hypothesis 1

Train VP on RNN representation

The tree kernel replaced by a linear kernel

State vector representation of parse trees generated by RNN as input to VP

Linear VP is trained on RNN state vectors


34

Linear VP on RNN Representation

70717273747576777879

Split 1 Split 2 Split 3 Split 4 Split 5

5 Indipendent Splits of 100 Sentences

VP - Average 75.4

RNN - Average 77

VP on RNN State -Average 74.7


35

Conclusions

RNNs show better generalization properties…… also on small datasets… at smaller computational cost

The problem is…… neither the kernel function… nor the VP algorithm

Reasons: linear VP on RNN representation experiment

The problem is… … the preference model!

Reasons: kernel preference model does not take into consideration all the alternatives together, but only two by two as opposed to RNN


36

Acknowledgements

Thanks to:

Alessio Ceroni Alessandro Vullo

Andrea Passerini Giovanni Soda

Documents

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science