40
Learning Non-Linear Functions for Text Classification Cohan Sujay Carlos 1 Geetanjali Rakshit 2 Aiaioo Labs, Bangalore, India IIT-B Monash Research Academy, Mumbai, India December 20, 2016 Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification

Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Learning Non-Linear Functions for Text

Classification

Cohan Sujay Carlos1 Geetanjali Rakshit2

Aiaioo Labs, Bangalore, India

IIT-B Monash Research Academy, Mumbai, India

December 20, 2016

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 1/40

Page 2: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 2/40

Page 3: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 3/40

Page 4: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

XOR

-

6

0 1

0

1

e

eu

u(0,0)

(1,1)(0,1)

(1,0)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 4/40

Page 5: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

XOR

-

6

0 1

0

1

e

eu

u(0,0)

(1,1)(0,1)

(1,0)

@@@@@@@@@@@@

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 5/40

Page 6: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

XOR

-

6

0 1

0

1

e

eu

u(0,0)

(1,1)(0,1)

(1,0)������������

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 6/40

Page 7: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Motivation

Multilayer neural networks have been shown to be capable oflearning non-linear boundaries.

Questions:

Is there something special about neural networks?

Can probabilistic classifiers also learn non-linearboundaries?

In particular, can multilayer Bayesian models do so andunder what conditions?

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 7/40

Page 8: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Bayesian Models = Directed PGMs

x1x2

x4 x3

jjjj-

6

-

The joint probabilities factorise according to Equation 1.

P(x1, x2 . . . xn) =

( ∏1≤i≤n

P(xi |xa(i))

)(1)

where a(i) means ancestors of node i .

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 8/40

Page 9: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Naive Bayes

C

F jj?

The naive Bayes classifier is a special case two layers deep.

P(F ,C ) = P(F |C )P(C ) =∏f ∈F

P(f |C )P(C ) (2)

The root node represents a class label from setC = {c1, c2 . . . ck} and the leaf nodes the set of featuresF = {f1, f2 . . . fi}.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 9/40

Page 10: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 10/40

Page 11: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

3-Layer Bayesian Models

C

H

F jjj

?

?

In a 3-layer Bayesian model, there is also a set of hidden nodesH = {h1, h2 . . . hj}.

P(F ,H ,C ) = P(F |H ,C )P(H |C )P(C ) (3)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 11/40

Page 12: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

3-Layer Bayesian Models

c

h

F jjj

?

?

From P(F , h, c) = P(F |h, c)P(h|c)P(c) we obtain 4 and 5.

P(F , c) =∑h

P(F |h, c)P(h|c)P(c) (4)

P(F |c) =∑h

P(F |h, c)P(h|c) (5)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 12/40

Page 13: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

3-Layer Bayesian Modelsc1

h1

f1 jjj

?

?h2

j���

@@@R

Starting from f instead of F , we get 8.

P(f , h, c) = P(f |h, c)P(h|c)P(c) (6)

P(f , c) =∑h

P(f |h, c)P(h|c)P(c) (7)

P(f |c) =∑h

P(f |h, c)P(h|c) (8)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 13/40

Page 14: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Product of Sums

You can derive the likelihood as follows.

P(F |c) =∏f

P(f |c) (9)

Substituting Equation 8 into Equation 9, we get Equation 13.

P(F |c) =∏f

∑h

P(f |h, c)P(h|c) (10)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 14/40

Page 15: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Sum of Products

You can also derive the likelihood as follows.

P(F |h, c) =∏f

P(f |h, c) (11)

Substituting Equation 11 into Equation 5, we get Equation 14.

P(F |c) =∑h

(∏f

P(f |h, c)

)P(h|c) (12)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 15/40

Page 16: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Two Likelihood Equations

Product of Sums

P(F |c) =∏f

∑h

P(f |h, c)P(h|c) (13)

Sum of Products

P(F |c) =∑h

(∏f

P(f |h, c)

)P(h|c) (14)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 16/40

Page 17: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 17/40

Page 18: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Proof of Linearity

If any classifier can be proved to be equivalent to amultinomial naive Bayes classifier, then it can only learn lineardecision boundaries.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 18/40

Page 19: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Proof of Linearity

Consider a toy dataset consisting of two sentences, bothbelonging to the class Politics (P):

The United States and (TUSa)

The United Nations (TUN)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 19/40

Page 20: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Proof of LinearityOne of the parameters a naive Bayes classifier would learn isP(The|Politics) = z .

P

T

z = 27

TUSaTUNj

j

?

If a directed PGM learns the same likelihoods, which is aproduct of these parameters, it can only learn linear decisionboundaries.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 20/40

Page 21: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Proof of Linearity - Hard EM

The parameters of a multinomial 3-layer Bayesian classifier:

P TUSaTUN

h1 h2

T

a = 14

b = 13

m = 47

n = 37

TUNTUSa

j

j j

j

@@@@@@R

���

���

@@@@@@R

��

����

From Equation 13, z = am + bn = 14

47

+ 13

37

= 27

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 21/40

Page 22: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Proof of Linearity - Hard EM

The parameters of a multinomial 3-layer Bayesian classifier:

P TUSaTUN

h1 h2

T

a = 00

b = 27

m = 07

n = 77

TUSaTUN

j

j j

j

@@@@@@R

���

���

@@@@@@R

��

����

From Equation 13, z = am + bn = 00

07

+ 27

77

= 27

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 22/40

Page 23: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Proof of Linearity - Soft EM

The parameters of a multinomial 3-layer Bayesian classifier:

P TUSaTUN

h1 h2

T

a =45

+ 15

3.8b =

15

+ 45

3.2

m = 45

47

+ 15

37

= 3.87

n = 15

47

+ 45

37

= 3.27

TUSa 45

TUN 15

TUSa 15

TUN 45

j

j j

j

@@@@@@R

���

���

@@@@@@R

��

����

From Equation 13, z = am + bn = 13.8

3.87

+ 13.2

3.27

= 27

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 23/40

Page 24: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 24/40

Page 25: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Shifted and Scaled XOR

-

6

1 3 5

1

3

5

e

eu

u(0,0)

(5,5)(1,5)

(5,1)

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 25/40

Page 26: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Sum of Products - Multinomial

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 26/40

Page 27: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Sum of Products - Gaussian

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 27/40

Page 28: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 28/40

Page 29: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Accuracy on Reuters & 20 Ng

Classifier R8 R52 20NgNaive Bayes 0.955 0.916 0.797MaxEnt 0.953 0.866 0.793SVM 0.946 0.864 0.711Non-linear 3-Layer Bayes 0.964 0.921 0.808

Table: Accuracy on Reuters R8, R52 and 20 Newsgroups datasets.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 29/40

Page 30: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Accuracy on Movie Reviews

Classifier Movie ReviewsNaive Bayes 0.7964± 0.0136MaxEnt 0.8355± 0.0149SVM 0.7830± 0.0128Non-linear 3-Layer Bayes 0.8438 ±0.0110

Table: Accuracy on Large Movie Review dataset.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 30/40

Page 31: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Accuracy on Movie Reviews

5 10 15 20 25 30 35

0.75

0.8

0.85

0.9

1000s of training reviews

Acc

urac

y

Naive Bayes

Non-Linear Hierarchical Bayes

MaxEnt

SVM

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 31/40

Page 32: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Accuracy on Movie Reviews

5 10 15 20 25 30 35

0.8

0.85

0.9

1000s of training reviews

Acc

urac

y

1 hidden node per class

2 hidden nodes per class

3 hidden nodes per class

4 hidden nodes per class

5 hidden nodes per class

B

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 32/40

Page 33: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

On the Query Classification Dataset

Classifier AccuracyNaive Bayes 0.721± 0.018MaxEnt 0.667± 0.028SVM 0.735± 0.032Non-Linear 3-Layer Bayes 0.711± 0.032

Table: Accuracy on query classification.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 33/40

Page 34: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Accuracy on Artificial Dataset

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 34/40

Page 35: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Accuracy on Artificial Dataset

Shape GNB GHBring 0.664 0.949dots 0.527 0.926XOR 0.560 0.985S 0.770 0.973

Table: Accuracy on the artificial dataset.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 35/40

Page 36: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Outline

1 Background

2 3-Layer Bayesian ModelsProduct of SumsSum of Products

3 Linearity of “Product of Sums”

4 Non-Linearity of “Sum of Products”

5 Experimental Results

6 Conclusions

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 36/40

Page 37: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Conclusions

We have shown that generative classifiers with a hiddenlayer are capable of learning non-linear decisionboundaries under the right conditions (independenceassumptions), and therefore might be said to be capableof deep learning.

We have also shown experimentally that multinomialnon-linear 3-layer Bayesian classifiers can outperformsome linear classification algorithms on some textclassification tasks.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 37/40

Page 38: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Errata

In our survey of prior work ...

Hierarchical Bayesian models are directed probabilisticgraphical models in which the directed acyclic graph is a tree(each node that is not the root node has only one immediateparent). In these models the ancestors a(i) of any node i donot contain any siblings (every pair of ancestors is also in anancestor-descendent relationship).

The above statement which is found in our paper is incorrect.Hierarchical Bayesian models including ours are generallyspeaking DAGs.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 38/40

Page 39: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Future Work

Comparison of performance with deep neural networks;evaluation on image processing tasks; other trainingalgorithms - Gibbs’ sampling (MCMC), variationalmethods.

Going from 3 layers to higher numbers of layers (solvingthe “vanishing information” problem).

Integration with larger models, like sequential models(HMMs) and translation models (alignment models orsequence to sequence models).

Probabilistic embeddings.

Unified model covering Bayesian models and neuralnetworks.

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 39/40

Page 40: Learning Non-Linear Functions for Text Classification · Learning Non-Linear Functions for Text Classi cation Cohan Sujay Carlos1 Geetanjali Rakshit2 Aiaioo Labs, Bangalore, India

Thank you!

Cohan Sujay Carlos & Geetanjali [email protected]

Cohan Sujay Carlos, Geetanjali Rakshit — Learning Non-Linear Functions for Text Classification 40/40