Vector space classification

VECTOR SPACE CLASSIFICATION

INTRODUCTION Each document is a vector, one component for each term A training set is a set of documents, each labelled with its class

In vector space classification, this set corresponds to a labelled set of points Documents in the same class form a contiguous region of space Documents from different class don’t overlap

We define surfaces to delineate classes in the space

DOCUMENTS IN VECTOR SPACE

Government

Science

Arts

TEST DOCUMENT OF WHAT CLASS

Government

Science

Arts

TEST DOCUMENT = GOVERNMENT

Government

Science

Arts

ASIDE 2D/3D GRAPHS CAN BE MISLEADING

VECTOR SPACE CLASSIFICATION METHODS Rocchio classification

Divides the vector space into regions centered on centroids / center of mass Simple and efficient Classes should be approximately spherical with similar radii

kNN classification No explicit training is required Less efficient Can handle non-spherical and complex classes better than rocchio

Two-class classifiers One-of task – a document may be assigned to exactly one of several mutually

exclusive classes Any-of task – a document can be assigned to any number of classes

USING ROCCHIO FOR TEXT CLASSIFICATION Relevance feedback method is adapted – it’s a two-class classification:

relevant or non-relevant

Use standard tf-idf weighted vector to represent text document

For training document in each category, compute a prototype vector by summing the vectors of all the training documents in the category Prototype = centroid of members of class

Assign test documents to the category with closest prototype vector based on cosine similarity

ILLUSTRATION OF ROCCHIO TEXT CATEGORIZATION

DEFINITION OF CENTROID Where Dc is the set of all documents that belong to class c and v(d) is the

vector space representation of d

Properties: Forms a simple generalization of examples in each class Classification is based on the similarity to class prototype Does not guarantee the consistency of classification with given training data

(c) 1

|Dc |

v (d)

d Dc

ROCCHIO ANOMALY Prototype models face problems with polymorphic categories

ROCCHIO CLASSIFICATION Forms a simple representation for each class – centroid/prototype

Classification is based on the distance form the centroid

Cheap to train and test documents

Not preferred outside text classification Used quiet effectively used for text classification Worse than naïve bayes

K NEAREST NEIGHBOR CLASSIFICATION To classify the document d in class c

Define k neighborhood N as the nearest neighbors of d

Count number of document i in N that belong to c

Estimate P(c/d) as i/k

Choose as class argmax P(c/d) [=majority class]

NEAREST-NEIGHBOR LEARNING ALGORITHM Learning is just storing the representation of training examples in D

Testing instance x (1NN): Compute similarity between x and all examples in D Assign x the category of most similar example in D

Does not explicitly compute the category

Also called as: Case-based learning Memory based learning Lazy learning

Rationale of kNN: contiguity hypothesis (documents in same class form a continuous region and regions of different classes do not overlap)

K NEAREST NEIGHBOR Using the closest example to determine the class is subject to errors due to

a single typical example Noise in the category label of single training example

SEPARATION BY HYPERPLANES A strong high-bias assumption is linear seperability

In 2d can separate class by a line In higher dimension need a hyperplane

seperating hyperplane can be found by linear programming (or a perceptron) Can be expressed as ax + by = c

LINEAR PROGRAMMING/PERCEPTRON

Find a,b,c, such thatax + by > c for red pointsax + by < c for blue points

WHICH HYPERPLANE

In general, lots of possible solutions for a,b,c.

WHICH HYPERPLANE Lots of possible solutions for a, b, c

Somp methods find a separating hyperplane but not the optimal one. Ex. Perceptron

Which points should influence optimality? All points

Linear/logistic regression Naïve bayes

Only difficult points close to boundary Support vector machines

LINEAR CLASSIFIERS Many common text classifiers are linear classifies:

Naïve Bayes Perceptron Rocchio Logistic regression Support Vector Machines

Despite the similarity, noticeable performances differ For separate problems, there are infinite number of separating hyperplanes.

Which one do you choose? What to do for non-separable problems? Different training methods pick different hyperplanes

ROCCHIO IS A LINEAR CLASSIFIER

A NON –LINEAR PROBLEMA linear classifier like naïve bayes does badly on this task

kNN does well (assuming enough training data)

HIGH-DIMENSIONAL DATA Pictures like the one on the right is misleading

Documents are zero along almost all axes

Most document pairs are very far apart (i.e. not strictly orthogonal, but only share very few common words)

In classification terms – often document sets are separable

This is part of why linear classifiers are quiet successful in this domain

MORE THAN TWO CLASSES Any-of or multivalue classification

Classes are independent of each other A document can belong to any number of classes Decomposes into n binary problems Quiet common for documents

One-of or multinomial or polytomous classification Classes are mutually exclusive Each document belong to exactly one class

SET OF BINARY CLASSIFIERS Any-of

Build a separator between each class and its complementary set Given test documents, evaluate it for membership in each class Apply decision criteria of classifiers independently

One-of Build a separator between each class and it complementary set Given test doc, evaluate it for membership in each class Assign document to class with

Maximum score Maximum confidence Maximum probability

?

??

Data & Analytics

Vector space classification