www.postitfirst.com
CLUSTERINGIt is a process of partitioning a set of data in a set of meaningful subclasses. Every data in the subclass
shares a common trait.It helps a user understand the natural grouping or
structure in a data set.
Categorization
Classification is a technique used to predict group membership for data instances. For example, you may wish to use classification to predict whether the weather on a particular day will be “sunny”,
“rainy” or “cloudy”.
www.postitfirst.com
CLASSIFICATIONThe goal of data classification is to organize and
categorize data into distinct classes A model is first created based on the data
distribution The model is then used to classify new data
Given the model, a class can be predicted for new data
Classification Process Model Construction Model Evaluation
Model Use
www.postitfirst.com
Usually train-and-testExploit an existing collection in which documents
have already been classifieda portion used as the training setanother portion used as a test set
permits measurement of classifier effectivenessallows tuning of classifier parameters to yield
maximum effectiveness
Single- vs. multi-labelcan 1 document be assigned to multiple
categories?
www.postitfirst.com
Manual (a.k.a. Knowledge Engineering)typically, rule-based expert systems
Machine LearningProbabalistic (e.g., Naïve Bayesian)
Decision Structures (e.g., Decision Trees)Profile-Based
compare document to profile(s) of subject classessimilarity rules similar to those employed in I.R.
Support Machines (e.g., SVM)
www.postitfirst.com
Assign to each document up to k terms drawn from a controlled vocabulary
Typically reduced to a multi-label classification problem
each keyword corresponds to a class of documents for which that keyword is an
appropriate descriptor
www.postitfirst.com
Document Collection from DTIC10,000 documents
previously classified manually
Taxonomy of25 broad subject fields, divided into a total of
251 narrower groups
Document lengths average 27051464 words, 623274 significant unique terms.
Collection has 32457 significant unique terms
www.postitfirst.com
Document Size Distribution
0
10
20
30
40
50
60
70
80
0-1000 1001-2000 2001-3000 3001-4000 4001-5000 5001-6000 6001-7000 7001-8000 8001-
words per document
do
cu
me
nts
www.postitfirst.com
Binary Classifier Finds the plane with
largest margin to separate the two classes of training
samplesSubsequently classifies items based on which side of line they
fall