Incremental Context Mining for Adaptive Document Classification

Preview:

DESCRIPTION

Incremental Context Mining for Adaptive Document Classification. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Rey-Long Liu Yun-Ling Lu. Outline. Motivation Objective Introduction Overview of the approach Incremental context mining for ACclassifier Experiments - PowerPoint PPT Presentation

Citation preview

Incremental Context Mining for Adaptive Document Classification

Advisor : Dr. HsuGraduate : Chien-Shing ChenAuthor : Rey-Long Liu

Yun-Ling Lu

Motivation Objective Introduction Overview of the approach Incremental context mining for ACclassifier Experiments Conclusions Personal Opinion Review

Outline

Motivation

Adaptive document classification (ADC) that adapts a DC system to the evolving contextual requirement of each document category, so that input documents may be classified based on their contexts of discussion.

Objective

1.CR terms should be mined by analyzing multiple documents from multiple categories.

2.Inappropriate feature may introduce the problems of inefficiency and errors.

3.ADC may serve as the basis for supporting efficient and high-precision DC.

1.Introduction

Two components of ACclassifier (Adaptive Context-based Classifier). 1. An incremental context miner 2. Document classifier.

Both components work on a given text hierarchy in which a node corresponds to a document category.

2.Overview of the approach

CR of 管理學院

CR of 資管 CR of 財管

CR of MIS CR of DSS CR of 管理學

3-1.An incremental context miner

管理學院

資管 財管

MIS DSS 管理學

3-2.An incremental context miner

資管

MIS DSS

Computer 5/20Dos 10/20EC 2/20

Manage 5/30BtoB 3/20 Computer 10/40

Notebook 3/40Computer 3/15

3-3.CR

MIS

Computer 15/90Dos 10/90

Manage 5/90EC 3/90

BtoB 3/90Notebook 3/90

CR : Contextual Requirement of the category

DSS

Computer 3/15EC 10/15

Strength: w serving as a context word for the documents under c

TFIDF (Term Frequency * Inverse Document Frequency)

3-4. TFIDF

3-5. TFIDF

Strength(Wcomputer,CMIS)=

Strength(Wdos,CMIS)=

3-6. The incremental context miner

資管

MIS DSS

S(computer)=0.909

S(dos)=2S(EC)=0.476

S(computer)=0.022

S(computer)>0.909 電機

3-7.An incremental context miner

4-1. DOA

Given a document d to be classified, the basic idea is to compute the degree of acceptance (DOA).The DOA is computed based on the strengths of d ’s distinct words on c.

4-2. Two phases of classifier

(1) The estimation of DOA for each category.(2) The identification of the winner category.

4-3. Estimation of DOA for each C

DOA of 管理學院

DOA of 資管 DOA of 財管

DOA of MIS DOA of DSS DOA of 管理學

4-4. DOA

Frequency:5D1 : 5000minSupport:0.001

If w is a strong context word in c and occurs many times in d, c is more likely to “accept” d.

4-5. Constraint I

New Di

Computer 20/40DOS 10/40Java 2/40

Mouse 3/40Delphi 1/40

4-6. Constraint II

資管課程

作業系統S(DOS)

=2

演算法S(DOS)=0.9982

資訊網路S(DOS)

=0.6

電子商務S(DOS)=0.003

資料結構S(DOS)=1.112

4-7. Given a document to be classified

MIS DSS New Di

Computer 20/40DOS 10/40

S(computer)=0.909

S(dos)=2

S(EC)=0.476

S(computer)=0.022

If w is a strong context word in c and occurs many times in d, c is more likely to “accept” d.

4-8. DOA

DOAMIS=0.909 * 20/40 = 0.4545

DOAMIS=2 * 10/40 = 0.5

DOAMIS=0.9545

DOAMIS of Dnew

DOA of 管理學院

DOA of 資管 DOA of 財管

DOA of MIS DOA of DSS DOA of 管理學

DOA of 管理實務

4-9. Complete the DOA of all Category

4-9. The document classifier

5-1. correct classification

Builting from the 1100 documents for initial training.

5-2. correct classification

Baseline :allowed to use 5000 features in their feature set.

5-3. correct classification

Using all training documents to build their feature set and classifiers.

5-4. Consider the test document entitled

“Setting up Email in DOS with today’s ISP using a dialup PPP TCP/IP connection”.

Baseline systems: “Software”,””Windows”,and “Operating Systems”

ACclassifier:”TCP/IP”,”connection”,”computernetworking”,”userID”

5-5. cumulative training & testing time(sec.)

The time spent by ACclassifier grew slower when about 1400 training documents were entered.

5-6. cumulative training & testing time(sec.)

The time spent by ACclassifier grew slower when about 1400 training documents were entered.

6. Conclusions

1.Efficient mining of the contextual requirements for high-precision DC.2.Incremental mining without reprocessing previous documents.3.Evolutionary maintenance of the feature set.4.Efficient and fault-tolerant hierarchical DC.

7.Personal Opinion

It’s acceptable on purity in hierarchy.

8.Review

Recommended