Upload
beatrice-moody
View
216
Download
0
Embed Size (px)
DESCRIPTION
Problem Definition 3 Classification of Unknown Documents by Concept Graph Training set Representative words for each class c1 c2 cncn Implicit assumption: Training set ~ Test set Automatic classification Feature selection Test set Dependent on the training set
Citation preview
Improving the Classification of Unknown
Documents by Concept Graph
Morteza Mohagheghi ([email protected])
Reza Soltanpour ([email protected])
Azadeh Shakeri ([email protected])
ECE Department, University of Tehran, Tehran, Iran.
Agenda
Problem Definition Introduction to Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion
2Classification of Unknown
Documents by Concept Graph
Problem Definition
3Classification of Unknown
Documents by Concept Graph
Training
set
Representative words for each class
c1
c2
cn
Implicit assumption: Training set ~ Test set
Automatic classification
Feature selection
Test set
Dependent on the
training set
An Overview of the Solution
4Classification of Unknown
Documents by Concept Graph
Training
set
Representative words for each class
c1
c2
cn
Our assumption: Training set ≠ Test set
Automatic classification
Feature selection
Test set
Concept
GraphFeature
Enrichment
Agenda
Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion
5Classification of Unknown
Documents by Concept Graph
Concept Graph: Definition
Definition: A weighted graph in which the nodes are
terms and edges are the semantic relationship between
the terms Application: keyword suggestion ,
query expansion Representative Vector: The list of
most related words to a specific term
in the concept graph
6Classification of Unknown
Documents by Concept Graph
Player weight
Coach .0102
Playground .0077
Football .0069
Newspaper .0056
Club .0052
Team .0046
… …
Concept Graph: Construction method NLP based methods: accurate but costly Statistical methods:
language independent Computationally efficient
Recursive vector creation method: at the basis of a
rich corpora: e.g. wikipedia
7Classification of Unknown
Documents by Concept Graph
Agenda
Problem Definition Concept Graph C.G. Aided Classification A Sample Implementation Assessment Conclusion
8Classification of Unknown
Documents by Concept Graph
An Overview of the Solution
9Classification of Unknown
Documents by Concept Graph
Training
set
Representative words for each class
c1
c2
cn
Our assumption: Training set ≠ Test set
Automatic classification
Feature selection
Test set
Concept
GraphFeature
Enrichment
Concept Graph Aided method
Classification of Unknown Documents by Concept Graph 10
Select the features
from training set
(base set)
Select top n features
for each class
Normalize the
step 4’s terms
& add them to the “base
set”
5Extract most
frequently terms in
vectors
4
1 2Create rep. vector for
each of those top n
features
3
Classify the
documents from a new
resource
6
Training Phase
A Sample Implementation
Training set: Hamshahri: 1997-2002 (166,000 documents)
Concept Graph Resource: ISNA: 1997-2007 (500,000 documents)
Test set: Keyhan: 2007-2008 (3700 documents)
4 classes:
Classification of Unknown Documents by Concept Graph 11
Sports
Economy
Politics
Science
Step 1. Feature Selection
Mutual Information (MI): measures how much information the presence/absence of a term contributes to making the correct classification decision on c.
Classification of Unknown Documents by Concept Graph 12
Feature Selection from the
training set
Hamshahri: 1997-2002(166000 docs)
SportsFeatures
EconomyFeatures
PoliticsFeatures
ScienceFeatures
Selected features:
Step 2, 3. Rep. Vector Construction
Economy
Price change
Rena
ChimyDaroo
Chokopars
carton
Document
Sepanta
DarooPakhsh
tire
Lamiran
Classification of Unknown Documents by Concept Graph 13
Select top 10 features
for each class
2
Extract the representative
vector for each term
3
Price change
Capital
Iran
National
Income
…
Rena
Income
Country
Capital
Iran
…
Document
Income
Country
Capital
Iran
…
Chokopars
Income
Country
Capital
Iran
…
…
…
…
…
…
EconomyFeatures
Candidate
words
Step 4. Refine the Rep Vectors
1 if vectorf contains t
I (t, vectorf) = 0 otherwise
term frequency in vectors(tfvt):
Classification of Unknown Documents by Concept Graph 14
Most frequency words in
the vectors
4
10
1
),(f
ft vectortItfv
tfv term
7 Cqapital
6 Iran
5 development
4 company
4 industrial
4 economic
3 strategy
Step 5, 6. Feature Normalization, Classification
Multinomial Naive Bayes as the base:
in which P(tk|c) is the conditional probability of occurrence of term t in class c
Classification of Unknown Documents by Concept Graph 15
Normalize the
step 4’s terms
& add them to the “base
set”
5Classify the
documents from a new
resource
6
Assessment:
Classification of Unknown Documents by Concept Graph 16
Total recall Total precision Avg. Recall Avg.
Precision
Without enrichment 0.52 0.78 0.49 0.70
With enrichment 0.64 0.78 0.57 0.71
Performance:
Recall:Unclassified documents
Without enrichment 1219
With enrichment 680
Assessment:
Classification of Unknown Documents by Concept Graph 17
Performance comparison with a Persian classifier :
Total recallTotal
precision
4-gram 0.78 0.68
With enrichment 0.64 0.78
Conclusion and future work
Classification of Unknown Documents by Concept Graph 18
We proposed a classification method in which: is not dependent on the training set improves the classification recall has little impact on the performance is somehow language independent
Conclusion and future work
Classification of Unknown Documents by Concept Graph 19
However there are some subtleties:
The concept graph suggests very general words The normalization phase must be done precisely This version of concept graph works only with
single words (e.g. economic development is considered as two separate phrases)
Conclusion and future work
Classification of Unknown Documents by Concept Graph 20
future works:
Implementing the method using several classification
and feature selection algorithms
Study the negative impact of Farsi language problems
in the method (we believe this is not so much)
Usage of a richer corpora (e.g. Farsi Wikipedia) for
C.G. construction
Discussion
&
Question21
Classification of Unknown Documents by Concept Graph
Basic Classification Algorithm
22Classification of Unknown
Documents by Concept Graph
Finding the best class for a given document Multinomial Naive Bayes as the base:
in which P(tk|c) is the conditional probability of occurance pf term t in class c
Feature selection Extracting the features
MI:
23Classification of Unknown
Documents by Concept Graph