Upload
fathia
View
32
Download
0
Embed Size (px)
DESCRIPTION
Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data. Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil - PowerPoint PPT Presentation
Citation preview
Comparing Convolution Kernelsand Recursive Neural Networks
for Learning Preferences on Structured Data
Sauro Menchetti, Fabrizio Costa, Paolo FrasconiDepartment of Systems and Computer Science
Università di Firenze, Italyhttp://www.dsi.unifi.it/neural/
Massimiliano PontilDepartment of Computer Science
University College London, UK
ANNPR 2003, Florence 12-13 September 2003
2
Structured Data
Many applications…
… is useful to represent the objects of the domain by structured data (trees, graphs, …)
… better capture important relationships between the sub-parts that compose an object
ANNPR 2003, Florence 12-13 September 2003
3
Natural Language: Parse Trees
He was viceprevious president
.
SS
VPVP
NPNP ADVPADVP NPNP
PRPPRP VBDVBD RBRB NNNN NNNN ..
ANNPR 2003, Florence 12-13 September 2003
4
Structural Genomics:Protein Contact Maps
ANNPR 2003, Florence 12-13 September 2003
5
Document Processing: XY-Trees
0 0.00 1.00 0.00 1.00
1 0.00 0.23 0.00 1.00
1 0.23 0.26 0.00 1.00
1 0.23 0.29 0.00 1.00
1 0.00 0.23 0.00 0.01
1 0.00 0.23 0.02 1.00
1 0.00 0.23 0.00 0.01
1 0.01 0.12 0.02 1.00
1 0.01 0.12 0.23 1.00
ANNPR 2003, Florence 12-13 September 2003
6
Predictive Toxicology, QSAR:Chemical Compounds as
Graphs
CH3(CH(CH3,CH2(CH2(CH3))))
CH3
CH
CH3 CH2
CH2
CH3
[-1,-1,-1,1]([-1,1,-1,-1]([-1,-1,-1,1],[-1,-1,1,-1]([-1,-1,1,-1]([-1,-1,-1,1]))))
ANNPR 2003, Florence 12-13 September 2003
7
Ranking vs. Preference
Ranking
Preference
55 334422
11
ANNPR 2003, Florence 12-13 September 2003
8
Preference on Structured Data
ANNPR 2003, Florence 12-13 September 2003
9
Classification, Regression and Ranking
Supervised learning taskf:X→Y
ClassificationY is a finite unordered set
RegressionY is a metric space (reals)
Ranking and PreferenceY is a finite ordered setY is a non-metric space
Metric space
Metric spaceUnorderedUnordered
Non-metric space
Finite Ordered
Non-metric space
Finite Ordered
Classification
Regression
Ranking and Preference
The Target Space
ANNPR 2003, Florence 12-13 September 2003
10
Learning on Structured Data
Learning algorithms on discrete structures often derive from vector based methods
Both Kernel Machines and RNNs are suitable for learning on structured domains
Conventional Learning Algorithms
Conventional Learning Algorithms
1 2
ANNPR 2003, Florence 12-13 September 2003
11
Kernels vs. RNNs
Kernel MachinesVery high-dimensional feature space
How to choose the kernel? prior knowledge, fixed representation
Minimize a convex functional (SVM)
Recursive Neural NetworksLow-dimensional space
Task-driven: representation depends on the specific learning task
Learn an implicit encoding of relevant information
Problem of local minima
ANNPR 2003, Florence 12-13 September 2003
12
A Kernel for Labeled Trees
Feature SpaceSet of all tree fragments (subtrees) with the only constraint that a father can not be separated from his children
Φn(t) = # occurences of tree fragment n in t
Bag of “something”A tree is represented by
Φ(t) = [Φ1(t),Φ2(t),Φ3(t), …]
K(t,s) = Φ(t)∙Φ(s) is computed efficiently by dynamic programming (Collins & Duffy, NIPS 2001)
A
C
A
B
BC
A
CB C
A B
B
C
A
C
A
B
B
A
CB
C
A
C
A
B
BC
Φ
ANNPR 2003, Florence 12-13 September 2003
13
Recursive Neural Networks
Composition of two adaptative functionsφ transition function
o output function
φ,o functions are implemented by feedforward NNs
Both RNN parameters and representation vectors are found by maximizing the likelihood of training data
AA
CCDD
AA CCBB
outputspace
outputspace
φw:X→Rn ow’:Rn→O
ANNPR 2003, Florence 12-13 September 2003
14
Recursive Neural Networks
Labeled Tree
Network UnfoldingPrediction
PhaseError
Correction
D B B
A
C E
output network
ANNPR 2003, Florence 12-13 September 2003
15
Preference Models
Kernel Preference ModelBinary classification of pairwise differences between instances
RNNs Preference ModelProbabilistic model to find the best alternative
Both models use an utility function to evaluate the importance of an element
ANNPR 2003, Florence 12-13 September 2003
16
Utility Function Approach
Modelling of the importance of an objectUtility function U:X→Rx>z ↔ U(x)>U(z)
If U is linearU(x)>U(z) ↔ wTx>wTz
U can be also model by a neural networkRanking and preference problems
Learn U and then sort by U(x)
U(z)=3 U(x)=11
ANNPR 2003, Florence 12-13 September 2003
17
Kernel Preference Model
x1 = best of (x1,…,xr)
Create a set of pairs between x1 and x2,…,xr
Set of constraints if U is linearU(x1)>U(xj) ↔ wTx1>wTxj ↔ wT(x1-xj)>0 for j=2,…,r
x1-xj can be seen as a positive example
Binary classification of differences between instancesx → Φ(x): the process can be easily kernelizedNote: this model does not take into consideration all the alternatives together, but only two by two
ANNPR 2003, Florence 12-13 September 2003
18
RNNs Preference Model
Set of alternatives (x1,x2,…,xr)
U modelled by a recursive neural network architecture
Compute U(xi) = o(φ(xi)) for i=1,…,ri
j
U(x )
i rU(x )
j=1
eo =
eThe error (yi - oi) is backpropagated through whole network
Note: the softmax function compares all the alternatives together at once
Softmax function
ANNPR 2003, Florence 12-13 September 2003
19
Learning Problems
First Pass Attachment
Modeling of a psycolinguistic phenomenon
Reranking Task
Reranking the parse trees output by a statistical parser
ANNPR 2003, Florence 12-13 September 2003
20
First Pass Attachment (FPA)
The grammar introduces some ambiguitiesA set of alternatives for each word but only one is correct
The first pass attachment can be modelled as a preference problem
4
It has no bearing 1 432
on
NP
PP
IN
NP
PP
ADVP
IN
NP
ADJP
NP
QP
IN
NP
SBAR
ADVP
NONE
NP
PRN
INPRP VBZ DT NN
PRP
NP
NP
VP
S
It has no bearing
1
32
PRP VBZ DT NN
PRP
NP
NP
VP
S
ANNPR 2003, Florence 12-13 September 2003
21
Heuristics forPrediction Enhancement
Specializing the FPA prediction for each class of wordGroup the words in 10 classes (verbs, articles, …)
Learn a different classifier for each class of words
Removing nodes from the parse tree that aren’t important for choosing between different alternatives
Tree reduction
Evaluation Measure =# correct trees ranked in first position
total number of sets
ANNPR 2003, Florence 12-13 September 2003
22
Experimental Setup
Wall Street Journal (WSJ) Section of Penn TreeBankRealistic Corpus of Natural Language
40,000 sentences, 1 million wordsAverage sentence length: 25 words
Standard Benchmark in Computational LinguisticsTraining on sections 2-21, test on section 23 and validation on section 24
ANNPR 2003, Florence 12-13 September 2003
23
Voted Perceptron (VP)
FPA + WSJ = 100 million trees for trainingVoted Perceptron instead of SVM (Freund & Schapire, Machine Learning 1999)
Online algorithm for binary classification of instances based on perceptron algorithm (simple and efficient)Prediction value: weighted sum of all training weight vectorsPerformance comparable to maximal-margin classifiers (SVM)
ANNPR 2003, Florence 12-13 September 2003
24
Kernel VP vs. RNNsNoun - 33%
86
88
90
92
94
96
98
100 500 2000 10000 40000
Training Set Size
VPRNN
Verb - 13.4%
80
85
90
95
100
100 500 2000 10000 40000
Training Set Size
VPRNN
Preposition - 12.6%
50
55
60
65
70
100 500 2000 10000 40000
Training Set Size
VPRNN
Article - 12.5%
65
70
75
80
85
90
95
100 500 2000 10000 40000
Training Set Size
VPRNN
ANNPR 2003, Florence 12-13 September 2003
25
Kernel VP vs. RNNsPunctuation - 11.7%
455055606570758085
100 500 2000 10000 40000
Training Set Size
VPRNN
Adjective - 7.5%
757779818385878991
100 500 2000 10000 40000
Training Set Size
VPRNN
Adverb - 4.3%
3035404550556065
100 500 2000 10000 40000
Training Set Size
VPRNN
Conjuction - 2.3%
5055606570758085
100 500 2000 10000 40000
Training Set Size
VPRNN
ANNPR 2003, Florence 12-13 September 2003
26
Kernel VP vs. RNNsModularization
Learning Curve
70
75
80
85
90
100 500 2000 10000 40000
Training Set Size
VPRNN
ANNPR 2003, Florence 12-13 September 2003
27
Small Datasets No Modularization
72
73
74
75
76
77
78
79
Split 1 Split 2 Split 3 Split 4 Split 5
5 Independent Splits of 100 Sentences
VP - Average 75.4RNN - Average 77
ANNPR 2003, Florence 12-13 September 2003
28
Complexity Comparison
VP does not scale linearly with the number of training examples as the RNNs do
Computational costSmall datasets
5 splits of 100 sentences ~ a week @ 2GHz CPU
CPU(VP) ~ CPU(RNN)
Large datasets (all 40,000 sentences)VP took over 2 months to complete an epoch @ 2GHz CPU
RNN learns in 1-2 epochs ~ 3 days @ 2GHz CPU
VP is smooth in respect to training iterations
ANNPR 2003, Florence 12-13 September 2003
29
Reranking Task
Reranking problem: rerank the parse trees generated by a statistical parser
Same problem setting of FPA (preference on forests)
1 forest/sentence vs. 1 forest/word (less computational cost involved)
StatisticalParser
StatisticalParser
ANNPR 2003, Florence 12-13 September 2003
30
Evaluation: Parseval Measures
Standard evaluation measure
Labeled Precision (LP)
Labeled Recall (LR)
Crossing Brackets (CBs)
Compare a parse from a parser with an hand parsing of a sentence
ANNPR 2003, Florence 12-13 September 2003
31
Reranking Task
6065707580859095
100
All Sentences
VP RNN
Model
≤ 40 Words (2245 sentences)
LR LP CBs0 CBs
2 CBs
VP 89.1 89.4 0.85 69.3 88.2
RNN 89.2 89.5 0.84 67.9 88.4
Model
≤ 100 Words (2416 sentences)
LR LP CBs0 CBs
2 CBs
VP 88.6 88.9 0.99 66.5 86.3
RNN 88.6 88.9 0.98 64.8 86.3
ANNPR 2003, Florence 12-13 September 2003
32
Why RNNs outperform Kernel VP?
Hypothesis 1
Kernel Function: feature space not focused on the specific learning task
Hypothesis 2
Kernel Preference Model worst than RNNs preference model
ANNPR 2003, Florence 12-13 September 2003
33
Linear VP on RNN Representation
Checking hypothesis 1
Train VP on RNN representation
The tree kernel replaced by a linear kernel
State vector representation of parse trees generated by RNN as input to VP
Linear VP is trained on RNN state vectors
ANNPR 2003, Florence 12-13 September 2003
34
Linear VP on RNN Representation
70717273747576777879
Split 1 Split 2 Split 3 Split 4 Split 5
5 Indipendent Splits of 100 Sentences
VP - Average 75.4
RNN - Average 77
VP on RNN State -Average 74.7
ANNPR 2003, Florence 12-13 September 2003
35
Conclusions
RNNs show better generalization properties…… also on small datasets… at smaller computational cost
The problem is…… neither the kernel function… nor the VP algorithm
Reasons: linear VP on RNN representation experiment
The problem is… … the preference model!
Reasons: kernel preference model does not take into consideration all the alternatives together, but only two by two as opposed to RNN
ANNPR 2003, Florence 12-13 September 2003
36
Acknowledgements
Thanks to:
Alessio Ceroni Alessandro Vullo
Andrea Passerini Giovanni Soda