Incremental Context Mining for Adaptive Document Classification
Advisor : Dr. HsuGraduate : Chien-Shing ChenAuthor : Rey-Long Liu
Yun-Ling Lu
Motivation Objective Introduction Overview of the approach Incremental context mining for ACclassifier Experiments Conclusions Personal Opinion Review
Outline
Motivation
Adaptive document classification (ADC) that adapts a DC system to the evolving contextual requirement of each document category, so that input documents may be classified based on their contexts of discussion.
Objective
1.CR terms should be mined by analyzing multiple documents from multiple categories.
2.Inappropriate feature may introduce the problems of inefficiency and errors.
3.ADC may serve as the basis for supporting efficient and high-precision DC.
1.Introduction
Two components of ACclassifier (Adaptive Context-based Classifier). 1. An incremental context miner 2. Document classifier.
Both components work on a given text hierarchy in which a node corresponds to a document category.
2.Overview of the approach
CR of 管理學院
CR of 資管 CR of 財管
CR of MIS CR of DSS CR of 管理學
3-1.An incremental context miner
管理學院
資管 財管
MIS DSS 管理學
3-2.An incremental context miner
資管
MIS DSS
Computer 5/20Dos 10/20EC 2/20
Manage 5/30BtoB 3/20 Computer 10/40
Notebook 3/40Computer 3/15
3-3.CR
MIS
Computer 15/90Dos 10/90
Manage 5/90EC 3/90
BtoB 3/90Notebook 3/90
CR : Contextual Requirement of the category
DSS
Computer 3/15EC 10/15
Strength: w serving as a context word for the documents under c
TFIDF (Term Frequency * Inverse Document Frequency)
3-4. TFIDF
3-5. TFIDF
Strength(Wcomputer,CMIS)=
Strength(Wdos,CMIS)=
3-6. The incremental context miner
資管
MIS DSS
S(computer)=0.909
S(dos)=2S(EC)=0.476
S(computer)=0.022
S(computer)>0.909 電機
3-7.An incremental context miner
4-1. DOA
Given a document d to be classified, the basic idea is to compute the degree of acceptance (DOA).The DOA is computed based on the strengths of d ’s distinct words on c.
4-2. Two phases of classifier
(1) The estimation of DOA for each category.(2) The identification of the winner category.
4-3. Estimation of DOA for each C
DOA of 管理學院
DOA of 資管 DOA of 財管
DOA of MIS DOA of DSS DOA of 管理學
4-4. DOA
Frequency:5D1 : 5000minSupport:0.001
If w is a strong context word in c and occurs many times in d, c is more likely to “accept” d.
4-5. Constraint I
New Di
Computer 20/40DOS 10/40Java 2/40
Mouse 3/40Delphi 1/40
4-6. Constraint II
資管課程
作業系統S(DOS)
=2
演算法S(DOS)=0.9982
資訊網路S(DOS)
=0.6
電子商務S(DOS)=0.003
資料結構S(DOS)=1.112
4-7. Given a document to be classified
MIS DSS New Di
Computer 20/40DOS 10/40
S(computer)=0.909
S(dos)=2
S(EC)=0.476
S(computer)=0.022
If w is a strong context word in c and occurs many times in d, c is more likely to “accept” d.
4-8. DOA
DOAMIS=0.909 * 20/40 = 0.4545
DOAMIS=2 * 10/40 = 0.5
DOAMIS=0.9545
DOAMIS of Dnew
DOA of 管理學院
DOA of 資管 DOA of 財管
DOA of MIS DOA of DSS DOA of 管理學
DOA of 管理實務
4-9. Complete the DOA of all Category
4-9. The document classifier
5-1. correct classification
Builting from the 1100 documents for initial training.
5-2. correct classification
Baseline :allowed to use 5000 features in their feature set.
5-3. correct classification
Using all training documents to build their feature set and classifiers.
5-4. Consider the test document entitled
“Setting up Email in DOS with today’s ISP using a dialup PPP TCP/IP connection”.
Baseline systems: “Software”,””Windows”,and “Operating Systems”
ACclassifier:”TCP/IP”,”connection”,”computernetworking”,”userID”
5-5. cumulative training & testing time(sec.)
The time spent by ACclassifier grew slower when about 1400 training documents were entered.
5-6. cumulative training & testing time(sec.)
The time spent by ACclassifier grew slower when about 1400 training documents were entered.
6. Conclusions
1.Efficient mining of the contextual requirements for high-precision DC.2.Incremental mining without reprocessing previous documents.3.Evolutionary maintenance of the feature set.4.Efficient and fault-tolerant hierarchical DC.
7.Personal Opinion
It’s acceptable on purity in hierarchy.
8.Review