Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Learning Non-Linear Functions for Text

Classification

Cohan Sujay Carlos1 Geetanjali Rakshit2

Aiaioo Labs, Bangalore, India

IIT-B Monash Research Academy, Mumbai, India

December 20, 2016

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 1/40

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions


Outline

1 Background





6 Conclusions


XOR

-

6

0 1

0

1

e

eu

u(0,0)

(1,1)(0,1)

(1,0)


XOR

-

6

0 1

0

1

e

eu

u(0,0)

(1,1)(0,1)

(1,0)

@@@@@@@@@@@@


XOR

-

6

0 1

0

1

e

eu

u(0,0)

(1,1)(0,1)

(1,0)��


Motivation

Multilayer neural networks have been shown to be capable oflearning non-linear boundaries.

Questions:

Is there something special about neural networks?

Can probabilistic classifiers also learn non-linearboundaries?

In particular, can multilayer Bayesian models do so andunder what conditions?


Bayesian Models = Directed PGMs

x1x2

x4 x3

jjjj-

6

-

The joint probabilities factorise according to Equation 1.

P(x1, x2 . . . xn) =

( ∏1≤i≤n

P(xi |xa(i))

)(1)

where a(i) means ancestors of node i .


Naive Bayes

C

F jj?

The naive Bayes classifier is a special case two layers deep.

P(F ,C ) = P(F |C )P(C ) =∏f ∈F

P(f |C )P(C ) (2)

The root node represents a class label from setC = {c1, c2 . . . ck} and the leaf nodes the set of featuresF = {f1, f2 . . . fi}.


Outline

1 Background





6 Conclusions


3-Layer Bayesian Models

C

H

F jjj

?

?

In a 3-layer Bayesian model, there is also a set of hidden nodesH = {h1, h2 . . . hj}.

P(F ,H ,C ) = P(F |H ,C )P(H |C )P(C ) (3)


3-Layer Bayesian Models

c

h

F jjj

?

?

From P(F , h, c) = P(F |h, c)P(h|c)P(c) we obtain 4 and 5.

P(F , c) =∑h

P(F |h, c)P(h|c)P(c) (4)

P(F |c) =∑h

P(F |h, c)P(h|c) (5)


3-Layer Bayesian Modelsc1

h1

f1 jjj

?

?h2

j��

@@@R

Starting from f instead of F , we get 8.

P(f , h, c) = P(f |h, c)P(h|c)P(c) (6)

P(f , c) =∑h

P(f |h, c)P(h|c)P(c) (7)

P(f |c) =∑h

P(f |h, c)P(h|c) (8)


Product of Sums

You can derive the likelihood as follows.

P(F |c) =∏f

P(f |c) (9)

Substituting Equation 8 into Equation 9, we get Equation 13.

P(F |c) =∏f

∑h

P(f |h, c)P(h|c) (10)


Sum of Products

You can also derive the likelihood as follows.

P(F |h, c) =∏f

P(f |h, c) (11)

Substituting Equation 11 into Equation 5, we get Equation 14.

P(F |c) =∑h

(∏f

P(f |h, c)

)P(h|c) (12)


Two Likelihood Equations

Product of Sums

P(F |c) =∏f

∑h

P(f |h, c)P(h|c) (13)

Sum of Products

P(F |c) =∑h

(∏f

P(f |h, c)

)P(h|c) (14)


Outline

1 Background





6 Conclusions


Proof of Linearity

If any classifier can be proved to be equivalent to amultinomial naive Bayes classifier, then it can only learn lineardecision boundaries.


Proof of Linearity

Consider a toy dataset consisting of two sentences, bothbelonging to the class Politics (P):

The United States and (TUSa)

The United Nations (TUN)


Proof of LinearityOne of the parameters a naive Bayes classifier would learn isP(The|Politics) = z .

P

T

z = 27

TUSaTUNj

j

?

If a directed PGM learns the same likelihoods, which is aproduct of these parameters, it can only learn linear decisionboundaries.


Proof of Linearity - Hard EM

The parameters of a multinomial 3-layer Bayesian classifier:

P TUSaTUN

h1 h2

T

a = 14

b = 13

m = 47

n = 37

TUNTUSa

j

j j

j

@@@@@@R

��

��

@@@@@@R

��

��

From Equation 13, z = am + bn = 14

47

+ 13

37

= 27


Proof of Linearity - Hard EM


P TUSaTUN

h1 h2

T

a = 00

b = 27

m = 07

n = 77

TUSaTUN

j

j j

j

@@@@@@R

��

��

@@@@@@R

��

��

From Equation 13, z = am + bn = 00

07

+ 27

77

= 27


Proof of Linearity - Soft EM


P TUSaTUN

h1 h2

T

a =45

+ 15

3.8b =

15

+ 45

3.2

m = 45

47

+ 15

37

= 3.87

n = 15

47

+ 45

37

= 3.27

TUSa 45

TUN 15

TUSa 15

TUN 45

j

j j

j

@@@@@@R

��

��

@@@@@@R

��

��

From Equation 13, z = am + bn = 13.8

3.87

+ 13.2

3.27

= 27


Outline

1 Background





6 Conclusions


Shifted and Scaled XOR

-

6

1 3 5

1

3

5

e

eu

u(0,0)

(5,5)(1,5)

(5,1)


Sum of Products - Multinomial


Sum of Products - Gaussian


Outline

1 Background





6 Conclusions


Accuracy on Reuters & 20 Ng

Classifier R8 R52 20NgNaive Bayes 0.955 0.916 0.797MaxEnt 0.953 0.866 0.793SVM 0.946 0.864 0.711Non-linear 3-Layer Bayes 0.964 0.921 0.808

Table: Accuracy on Reuters R8, R52 and 20 Newsgroups datasets.


Accuracy on Movie Reviews

Classifier Movie ReviewsNaive Bayes 0.7964± 0.0136MaxEnt 0.8355± 0.0149SVM 0.7830± 0.0128Non-linear 3-Layer Bayes 0.8438 ±0.0110

Table: Accuracy on Large Movie Review dataset.



5 10 15 20 25 30 35

0.75

0.8

0.85

0.9

1000s of training reviews

Acc

urac

y

Naive Bayes

Non-Linear Hierarchical Bayes

MaxEnt

SVM



5 10 15 20 25 30 35

0.8

0.85

0.9

1000s of training reviews

Acc

urac

y

1 hidden node per class

2 hidden nodes per class




B


On the Query Classification Dataset

Classifier AccuracyNaive Bayes 0.721± 0.018MaxEnt 0.667± 0.028SVM 0.735± 0.032Non-Linear 3-Layer Bayes 0.711± 0.032

Table: Accuracy on query classification.


Accuracy on Artificial Dataset


Accuracy on Artificial Dataset

Shape GNB GHBring 0.664 0.949dots 0.527 0.926XOR 0.560 0.985S 0.770 0.973

Table: Accuracy on the artificial dataset.


Outline

1 Background





6 Conclusions


Conclusions

We have shown that generative classifiers with a hiddenlayer are capable of learning non-linear decisionboundaries under the right conditions (independenceassumptions), and therefore might be said to be capableof deep learning.

We have also shown experimentally that multinomialnon-linear 3-layer Bayesian classifiers can outperformsome linear classification algorithms on some textclassification tasks.


Errata

In our survey of prior work ...

Hierarchical Bayesian models are directed probabilisticgraphical models in which the directed acyclic graph is a tree(each node that is not the root node has only one immediateparent). In these models the ancestors a(i) of any node i donot contain any siblings (every pair of ancestors is also in anancestor-descendent relationship).

The above statement which is found in our paper is incorrect.Hierarchical Bayesian models including ours are generallyspeaking DAGs.


Future Work

Comparison of performance with deep neural networks;evaluation on image processing tasks; other trainingalgorithms - Gibbs’ sampling (MCMC), variationalmethods.

Going from 3 layers to higher numbers of layers (solvingthe “vanishing information” problem).

Integration with larger models, like sequential models(HMMs) and translation models (alignment models orsequence to sequence models).

Probabilistic embeddings.

Unified model covering Bayesian models and neuralnetworks.


Thank you!

Cohan Sujay Carlos & Geetanjali [email protected]


Documents

Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India