Upload
others
View
17
Download
0
Embed Size (px)
Citation preview
Learning Non-Linear Functions for Text
Classification
Cohan Sujay Carlos1 Geetanjali Rakshit2
Aiaioo Labs, Bangalore, India
IIT-B Monash Research Academy, Mumbai, India
December 20, 2016
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 1/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 2/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 3/40
XOR
-
6
0 1
0
1
e
eu
u(0,0)
(1,1)(0,1)
(1,0)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 4/40
XOR
-
6
0 1
0
1
e
eu
u(0,0)
(1,1)(0,1)
(1,0)
@@@@@@@@@@@@
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 5/40
XOR
-
6
0 1
0
1
e
eu
u(0,0)
(1,1)(0,1)
(1,0)������������
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 6/40
Motivation
Multilayer neural networks have been shown to be capable oflearning non-linear boundaries.
Questions:
Is there something special about neural networks?
Can probabilistic classifiers also learn non-linearboundaries?
In particular, can multilayer Bayesian models do so andunder what conditions?
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 7/40
Bayesian Models = Directed PGMs
x1x2
x4 x3
jjjj-
6
-
The joint probabilities factorise according to Equation 1.
P(x1, x2 . . . xn) =
( ∏1≤i≤n
P(xi |xa(i))
)(1)
where a(i) means ancestors of node i .
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 8/40
Naive Bayes
C
F jj?
The naive Bayes classifier is a special case two layers deep.
P(F ,C ) = P(F |C )P(C ) =∏f ∈F
P(f |C )P(C ) (2)
The root node represents a class label from setC = {c1, c2 . . . ck} and the leaf nodes the set of featuresF = {f1, f2 . . . fi}.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 9/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 10/40
3-Layer Bayesian Models
C
H
F jjj
?
?
In a 3-layer Bayesian model, there is also a set of hidden nodesH = {h1, h2 . . . hj}.
P(F ,H ,C ) = P(F |H ,C )P(H |C )P(C ) (3)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 11/40
3-Layer Bayesian Models
c
h
F jjj
?
?
From P(F , h, c) = P(F |h, c)P(h|c)P(c) we obtain 4 and 5.
P(F , c) =∑h
P(F |h, c)P(h|c)P(c) (4)
P(F |c) =∑h
P(F |h, c)P(h|c) (5)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 12/40
3-Layer Bayesian Modelsc1
h1
f1 jjj
?
?h2
j���
@@@R
Starting from f instead of F , we get 8.
P(f , h, c) = P(f |h, c)P(h|c)P(c) (6)
P(f , c) =∑h
P(f |h, c)P(h|c)P(c) (7)
P(f |c) =∑h
P(f |h, c)P(h|c) (8)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 13/40
Product of Sums
You can derive the likelihood as follows.
P(F |c) =∏f
P(f |c) (9)
Substituting Equation 8 into Equation 9, we get Equation 13.
P(F |c) =∏f
∑h
P(f |h, c)P(h|c) (10)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 14/40
Sum of Products
You can also derive the likelihood as follows.
P(F |h, c) =∏f
P(f |h, c) (11)
Substituting Equation 11 into Equation 5, we get Equation 14.
P(F |c) =∑h
(∏f
P(f |h, c)
)P(h|c) (12)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 15/40
Two Likelihood Equations
Product of Sums
P(F |c) =∏f
∑h
P(f |h, c)P(h|c) (13)
Sum of Products
P(F |c) =∑h
(∏f
P(f |h, c)
)P(h|c) (14)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 16/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 17/40
Proof of Linearity
If any classifier can be proved to be equivalent to amultinomial naive Bayes classifier, then it can only learn lineardecision boundaries.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 18/40
Proof of Linearity
Consider a toy dataset consisting of two sentences, bothbelonging to the class Politics (P):
The United States and (TUSa)
The United Nations (TUN)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 19/40
Proof of LinearityOne of the parameters a naive Bayes classifier would learn isP(The|Politics) = z .
P
T
z = 27
TUSaTUNj
j
?
If a directed PGM learns the same likelihoods, which is aproduct of these parameters, it can only learn linear decisionboundaries.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 20/40
Proof of Linearity - Hard EM
The parameters of a multinomial 3-layer Bayesian classifier:
P TUSaTUN
h1 h2
T
a = 14
b = 13
m = 47
n = 37
TUNTUSa
j
j j
j
@@@@@@R
���
���
@@@@@@R
��
����
From Equation 13, z = am + bn = 14
47
+ 13
37
= 27
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 21/40
Proof of Linearity - Hard EM
The parameters of a multinomial 3-layer Bayesian classifier:
P TUSaTUN
h1 h2
T
a = 00
b = 27
m = 07
n = 77
TUSaTUN
j
j j
j
@@@@@@R
���
���
@@@@@@R
��
����
From Equation 13, z = am + bn = 00
07
+ 27
77
= 27
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 22/40
Proof of Linearity - Soft EM
The parameters of a multinomial 3-layer Bayesian classifier:
P TUSaTUN
h1 h2
T
a =45
+ 15
3.8b =
15
+ 45
3.2
m = 45
47
+ 15
37
= 3.87
n = 15
47
+ 45
37
= 3.27
TUSa 45
TUN 15
TUSa 15
TUN 45
j
j j
j
@@@@@@R
���
���
@@@@@@R
��
����
From Equation 13, z = am + bn = 13.8
3.87
+ 13.2
3.27
= 27
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 23/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 24/40
Shifted and Scaled XOR
-
6
1 3 5
1
3
5
e
eu
u(0,0)
(5,5)(1,5)
(5,1)
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 25/40
Sum of Products - Multinomial
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 26/40
Sum of Products - Gaussian
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 27/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 28/40
Accuracy on Reuters & 20 Ng
Classifier R8 R52 20NgNaive Bayes 0.955 0.916 0.797MaxEnt 0.953 0.866 0.793SVM 0.946 0.864 0.711Non-linear 3-Layer Bayes 0.964 0.921 0.808
Table: Accuracy on Reuters R8, R52 and 20 Newsgroups datasets.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 29/40
Accuracy on Movie Reviews
Classifier Movie ReviewsNaive Bayes 0.7964± 0.0136MaxEnt 0.8355± 0.0149SVM 0.7830± 0.0128Non-linear 3-Layer Bayes 0.8438 ±0.0110
Table: Accuracy on Large Movie Review dataset.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 30/40
Accuracy on Movie Reviews
5 10 15 20 25 30 35
0.75
0.8
0.85
0.9
1000s of training reviews
Acc
urac
y
Naive Bayes
Non-Linear Hierarchical Bayes
MaxEnt
SVM
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 31/40
Accuracy on Movie Reviews
5 10 15 20 25 30 35
0.8
0.85
0.9
1000s of training reviews
Acc
urac
y
1 hidden node per class
2 hidden nodes per class
3 hidden nodes per class
4 hidden nodes per class
5 hidden nodes per class
B
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 32/40
On the Query Classification Dataset
Classifier AccuracyNaive Bayes 0.721± 0.018MaxEnt 0.667± 0.028SVM 0.735± 0.032Non-Linear 3-Layer Bayes 0.711± 0.032
Table: Accuracy on query classification.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 33/40
Accuracy on Artificial Dataset
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 34/40
Accuracy on Artificial Dataset
Shape GNB GHBring 0.664 0.949dots 0.527 0.926XOR 0.560 0.985S 0.770 0.973
Table: Accuracy on the artificial dataset.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 35/40
Outline
1 Background
2 3-Layer Bayesian ModelsProduct of SumsSum of Products
3 Linearity of “Product of Sums”
4 Non-Linearity of “Sum of Products”
5 Experimental Results
6 Conclusions
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 36/40
Conclusions
We have shown that generative classifiers with a hiddenlayer are capable of learning non-linear decisionboundaries under the right conditions (independenceassumptions), and therefore might be said to be capableof deep learning.
We have also shown experimentally that multinomialnon-linear 3-layer Bayesian classifiers can outperformsome linear classification algorithms on some textclassification tasks.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 37/40
Errata
In our survey of prior work ...
Hierarchical Bayesian models are directed probabilisticgraphical models in which the directed acyclic graph is a tree(each node that is not the root node has only one immediateparent). In these models the ancestors a(i) of any node i donot contain any siblings (every pair of ancestors is also in anancestor-descendent relationship).
The above statement which is found in our paper is incorrect.Hierarchical Bayesian models including ours are generallyspeaking DAGs.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 38/40
Future Work
Comparison of performance with deep neural networks;evaluation on image processing tasks; other trainingalgorithms - Gibbs’ sampling (MCMC), variationalmethods.
Going from 3 layers to higher numbers of layers (solvingthe “vanishing information” problem).
Integration with larger models, like sequential models(HMMs) and translation models (alignment models orsequence to sequence models).
Probabilistic embeddings.
Unified model covering Bayesian models and neuralnetworks.
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 39/40
Thank you!
Cohan Sujay Carlos & Geetanjali [email protected]
Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 40/40