23
Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi ([email protected]) Reza Soltanpour ([email protected]) Azadeh Shakeri ([email protected]) ECE Department, University of Tehran, Tehran, Iran.

Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Embed Size (px)

DESCRIPTION

Problem Definition 3 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Implicit assumption: Training set ~ Test set Automatic classification Feature selection Test set Dependent on the training set

Citation preview

Page 1: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Improving the Classification of Unknown

Documents by Concept Graph

Morteza Mohagheghi ([email protected])

Reza Soltanpour ([email protected])

Azadeh Shakeri ([email protected])

ECE Department, University of Tehran, Tehran, Iran.

Page 2: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Agenda

Problem Definition Introduction to Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion

2Classification of Unknown

Documents by Concept Graph

Page 3: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Problem Definition

3Classification of Unknown

Documents by Concept Graph

Training

set

Representative words for each class

c1

c2

cn

Implicit assumption: Training set ~ Test set

Automatic classification

Feature selection

Test set

Dependent on the

training set

Page 4: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

An Overview of the Solution

4Classification of Unknown

Documents by Concept Graph

Training

set

Representative words for each class

c1

c2

cn

Our assumption: Training set ≠ Test set

Automatic classification

Feature selection

Test set

Concept

GraphFeature

Enrichment

Page 5: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Agenda

Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion

5Classification of Unknown

Documents by Concept Graph

Page 6: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Concept Graph: Definition

Definition: A weighted graph in which the nodes are

terms and edges are the semantic relationship between

the terms Application: keyword suggestion ,

query expansion Representative Vector: The list of

most related words to a specific term

in the concept graph

6Classification of Unknown

Documents by Concept Graph

Player weight

Coach .0102

Playground .0077

Football .0069

Newspaper .0056

Club .0052

Team .0046

… …

Page 7: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Concept Graph: Construction method NLP based methods: accurate but costly Statistical methods:

language independent Computationally efficient

Recursive vector creation method: at the basis of a

rich corpora: e.g. wikipedia

7Classification of Unknown

Documents by Concept Graph

Page 8: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Agenda

Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion

8Classification of Unknown

Documents by Concept Graph

Page 9: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

An Overview of the Solution

9Classification of Unknown

Documents by Concept Graph

Training

set

Representative words for each class

c1

c2

cn

Our assumption: Training set ≠ Test set

Automatic classification

Feature selection

Test set

Concept

GraphFeature

Enrichment

Page 10: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Concept Graph Aided method

Classification of Unknown Documents by Concept Graph 10

Select the features

from training set

(base set)

Select top n features

for each class

Normalize the

step 4’s terms

& add them to the “base

set”

5Extract most

frequently terms in

vectors

4

1 2Create rep. vector for

each of those top n

features

3

Classify the

documents from a new

resource

6

Training Phase

Page 11: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

A Sample Implementation

Training set: Hamshahri: 1997-2002 (166,000 documents)

Concept Graph Resource: ISNA: 1997-2007 (500,000 documents)

Test set: Keyhan: 2007-2008 (3700 documents)

4 classes:

Classification of Unknown Documents by Concept Graph 11

Sports

Economy

Politics

Science

Page 12: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Step 1. Feature Selection

Mutual Information (MI): measures how much information the presence/absence of a term contributes to making the correct classification decision on c.

Classification of Unknown Documents by Concept Graph 12

Feature Selection from the

training set

Hamshahri: 1997-2002(166000 docs)

SportsFeatures

EconomyFeatures

PoliticsFeatures

ScienceFeatures

Selected features:

Page 13: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Step 2, 3. Rep. Vector Construction

Economy

Price change

Rena

ChimyDaroo

Chokopars

carton

Document

Sepanta

DarooPakhsh

tire

Lamiran

Classification of Unknown Documents by Concept Graph 13

Select top 10 features

for each class

2

Extract the representative

vector for each term

3

Price change

Capital

Iran

National

Income

Rena

Income

Country

Capital

Iran

Document

Income

Country

Capital

Iran

Chokopars

Income

Country

Capital

Iran

EconomyFeatures

Candidate

words

Page 14: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Step 4. Refine the Rep Vectors

1 if vectorf contains t

I (t, vectorf) = 0 otherwise

term frequency in vectors(tfvt):

Classification of Unknown Documents by Concept Graph 14

Most frequency words in

the vectors

4

10

1

),(f

ft vectortItfv

tfv term

7 Cqapital

6 Iran

5 development

4 company

4 industrial

4 economic

3 strategy

Page 15: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Step 5, 6. Feature Normalization, Classification

Multinomial Naive Bayes as the base:

in which P(tk|c) is the conditional probability of occurrence of term t in class c

Classification of Unknown Documents by Concept Graph 15

Normalize the

step 4’s terms

& add them to the “base

set”

5Classify the

documents from a new

resource

6

Page 16: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Assessment:

Classification of Unknown Documents by Concept Graph 16

Total recall Total precision Avg. Recall Avg.

Precision

Without enrichment 0.52 0.78 0.49 0.70

With enrichment 0.64 0.78 0.57 0.71

Performance:

Recall:Unclassified documents

Without enrichment 1219

With enrichment 680

Page 17: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Assessment:

Classification of Unknown Documents by Concept Graph 17

Performance comparison with a Persian classifier :

Total recallTotal

precision

4-gram 0.78 0.68

With enrichment 0.64 0.78

Page 18: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Conclusion and future work

Classification of Unknown Documents by Concept Graph 18

We proposed a classification method in which: is not dependent on the training set improves the classification recall has little impact on the performance is somehow language independent

Page 19: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Conclusion and future work

Classification of Unknown Documents by Concept Graph 19

However there are some subtleties:

The concept graph suggests very general words The normalization phase must be done precisely This version of concept graph works only with

single words (e.g. economic development is considered as two separate phrases)

Page 20: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Conclusion and future work

Classification of Unknown Documents by Concept Graph 20

future works:

Implementing the method using several classification

and feature selection algorithms

Study the negative impact of Farsi language problems

in the method (we believe this is not so much)

Usage of a richer corpora (e.g. Farsi Wikipedia) for

C.G. construction

Page 21: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Discussion

&

Question21

Classification of Unknown Documents by Concept Graph

Page 22: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Basic Classification Algorithm

22Classification of Unknown

Documents by Concept Graph

Finding the best class for a given document Multinomial Naive Bayes as the base:

in which P(tk|c) is the conditional probability of occurance pf term t in class c

Page 23: Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour

Feature selection Extracting the features

MI:

23Classification of Unknown

Documents by Concept Graph