Download pptx - Classified with keyword

POST IT FIRST

www.postitfirst.com

ClassifiedWith Key

Words

http://www.postitfirst.com/

www.postitfirst.com

AUTOMATIC CLUSTERING &

CLASSIFICATION


www.postitfirst.com

CLUSTERINGIt is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass

shares a common trait.It helps a user understand the natural grouping or

structure in a data set.

Categorization

Classification is a technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”,

“rainy” or “cloudy”.


www.postitfirst.com

CLASSIFICATIONThe goal of data classification is to organize and

categorize data into distinct classes A model is first created based on the data

distribution The model is then used to classify new data

Given the model, a class can be predicted for new data

Classification Process Model Construction Model Evaluation

Model Use


www.postitfirst.com

Usually train-and-testExploit an existing collection in which documents

have already been classifieda portion used as the training setanother portion used as a test set

permits measurement of classifier effectivenessallows tuning of classifier parameters to yield

maximum effectiveness

Single- vs. multi-labelcan 1 document be assigned to multiple

categories?


www.postitfirst.com

Manual (a.k.a. Knowledge Engineering)typically, rule-based expert systems

Machine LearningProbabalistic (e.g., Naïve Bayesian)

Decision Structures (e.g., Decision Trees)Profile-Based

compare document to profile(s) of subject classessimilarity rules similar to those employed in I.R.

Support Machines (e.g., SVM)


www.postitfirst.com

Assign to each document up to k terms drawn from a controlled vocabulary

Typically reduced to a multi-label classification problem

each keyword corresponds to a class of documents for which that keyword is an

appropriate descriptor


www.postitfirst.com

Document Collection from DTIC10,000 documents

previously classified manually

Taxonomy of25 broad subject fields, divided into a total of

251 narrower groups

Document lengths average 27051464 words, 623274 significant unique terms.

Collection has 32457 significant unique terms


www.postitfirst.com

Document Size Distribution

0

10

20

30

40

50

60

70

80

0-1000 1001-2000 2001-3000 3001-4000 4001-5000 5001-6000 6001-7000 7001-8000 8001-

words per document

do

cu

me

nts


www.postitfirst.com

Binary Classifier Finds the plane with

largest margin to separate the two classes of training

samplesSubsequently classifies items based on which side of line they

fall


www.postitfirst.com

Thank You So MuchAll My Dear Friends