©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek [email protected] [email protected]

©2012 Paula Matuszek

CSC 9010: Text Mining Applications:

Document-Based Techniques

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

mailto:[email protected]

mailto:[email protected]


Document Classification Document classifying

– Assign documents to pre-defined categories Examples

– Process email into work, personal, junk– Process documents from a newsgroup into

“interesting”, “not interesting”, “spam and flames”– Process transcripts of bugged phone calls into

“relevant” and “irrelevant” Issues

– Real-time?– How many categories/document? Flat or hierarchical?– Categories defined automatically or by hand?


Document Classification Usually

– relatively few categories– well defined; a person could do task easily– Categories don't change quickly

Flat vs Hierarchy– Simple classification is into mutually-exclusive

document collections– Richer classification is into hierarchy with multiple

inheritance– broader and narrower categories

– documents can go more than one place

– merges into search interfaces such as Pubmed


Classification -- Automatic Statistical approaches Set of “training” documents define categories

– Underlying representation of document derived from text– BOW– features we discussed last time

– Classification model is trained using machine learning– Individual documents classified applying the model

Requires relatively little effort to create categories

Accuracy heavily dependent on training examples

Typically limited to flat, mutually exclusive categories


Classification: Manual Natural Language/linguistic techniques Categories are defined by people

– underlying representation of document is typically stream of tokens

– category description contains– ontology of terms and relations

– pattern-matching rules

– individual documents classified by pattern-matching Defining categories can be very time-consuming Typically takes some experimentation to "get it

right" Can handle much more complex structures

Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm

Automatic Classification Framework

Documents PreprocessingFeature

Extraction

Feature filtering

Applyingclassificationalgorithms

Performancemeasure

http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm


Preprocessing

• Preprocessing:

transform documents into a suitable representation for classification task– Remove HTML or other tags– Remove stop words– Perform word stemming (Remove suffix)



Feature ExtractionMost crucial decision you’ll make!1. Topic

• Words, phrases, ?

2. Author• Stylistic features

3. Sentiment• Adjectives, ?

4. Spam• Specialized vocabulary

Features must relate to categories



Feature Filtering

• Feature selection:

remove non-informative terms from documents

=>improve classification effectiveness

=>reduce computational complexity



Evaluation We need to know how well our

classification system is performing– Recall: % of documents in a class which

are correctly classified as that class– ri = correctly classified as i / total which are i

– Precision: % of documents classified in a class which are actually in that class

– pi = correctly classified as i / total classified as i


Corpus

Documents classified into category

Documents actuallyin category

CorrectlyCategorized


Combined Effectiveness

• Ideally, we want a measure that combines both precision and recall

• F1: 2pr / p+r• If we accept everything, F1 = 0• If we accept nothing, F1 = 0• For perfect precision and recall, F1 = 1• If either precision or recall drops, so does F1


Measuring Individual Features

If we have a large feature set, we may be interested in which features are actually useful.

informative features: which features gives us the biggest separation between two classes

can probably omit least informative features without impacting performance

caution: correlation, not causation...


Choice of Evaluation Measure

For many tasks, F1 gives the best overall measure– sorting news stories– deciding genre or author

But it depends on your domain– spam filters– flagging important email


Evaluation: Overfitting

Training a model = predicting classification for our training set given the data in the set

Degrees of freedom: with 10 cases and 10 features I can always predict perfectly

Model may capture chance variations in set This leads to overfitting -- the model is too

closely matched to the exact data set it’s been given

More likely with– large number of features– small training sets


Evaluation: Training and Test Sets

To avoid (or at least detect) overfitting we always use separate training and test sets

Model is trained on one set of examples Evaluation measures are calculated on a

different set. sets should be comparable and each

should be representative of overall corpus


Some classification methods

Common classification algorithms include– nearest neighbor (KNN) methods– decision trees– naive Bayes classifiers– linear classification classifiers (e.g., SVMs)

18Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Principle: points (documents) that are close

in the space belong to the same class

http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt


K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Measure of similarity between test document and

each neighbor– count of words shared– tf*idf variants

• Select k nearest neighbors of a test document among training examples– more than 1 neighbor to avoid error of a single

atypical training example– K is typically 3 or 5

• Assign test document to the class which contains most of the neighbors



Analysis of KNN AlgorithmAnalysis of KNN Algorithm• Advantages:

– Effective – Can handle large, sparse vectors– “Training time” is short– Can be incremental

• Disadvantages:– Classification time is long– Difficult to find optimal value of k



Decision Tree AlgorithmDecision Tree Algorithm• Decision tree associated with document:

– Root node contains all documents– Each internal node is subset of documents

separated according to one attribute– Each arc is labeled with predicate which can

be applied to attribute at parent– Each leaf node is labeled with a class


Example Decision Tree

Text

no

Contains “Villanova”

0 1>1

Irrelevant Contains Wildcats

yes

Sports article Academic article

General Article

Decision Trees for Text

Each node is a single variable -- not useful for very large, very sparse vector such as BOW

Features might include– other document characteristics like diversity– counts for small subset of terms

– most frequent– tf*idf– domain-based ontology

Creating a Decision Tree

At each node, choose function which provides maximum separation

If all examples at new node are one class, stop for that node

Recur with each mixed node Stop when no choice improves

separation -- or when you reach predefined level


Analysis of Decision Tree AlgorithmAnalysis of Decision Tree Algorithm

• Advantages:– Easy to understand– Easy to train– Classification is fast

• Disadvantages:– Training time is relatively expensive– A document is only connected with one branch– Once a mistake is made at a higher level, any

subtree is wrong – Not suited for very high dimensions


Bayesian Methods

Based on probability Used widely in probabilistic learning and

classification. Uses prior probability of each category

given no information about an item. Categorization produces a posterior

probability distribution over the possible categories given description of item.

Naive Bayes

Bayes Theorem says we can determine probability of an event C given another event x based on– the overall probability of event C– the probability of event x given event C

P(C|x) = P(x|C) * P(C)/p(x)


Naïve Bayes AlgorithmNaïve Bayes Algorithm• Estimate the probability of each class for a

document: – Compute the posterior probability (Bayes rule)

– Assumption of word independency (Naive assumption)



Analysis of Naïve Bayes Analysis of Naïve Bayes AlgorithmAlgorithm

• Advantages:– Works well on numeric and textual data– Easy implementation and computation– has been effective in practice; typical spam

filter, for instance

• Disadvantages:– Conditional independence assumption is in fact

naive: usually violated by real-world data – performs poorly when features are highly

correlated


Linear Regression

Classic linear regression: predict the value of some variable based on a weighted sum of other variables

Very common statistical technique for prediction

e.g.: predict college GPA with a weighted sum of SAT verbal and quantitative scores, high school GPA, and a “high school quality” measure

Linear Scoring Methods

Generalization of linear regression to much higher dimensionality

Goal is binary separation of instances into 2 classes

Best known is SVM: support vector machine.– classifier is a separating hyperplane– support vectors are those features which

define the plane


Support Vector MachinesSupport Vector Machines

• Main idea of SVMsMain idea of SVMsFind out the linear separating hyperplane Find out the linear separating hyperplane which maximize the margin, i.e., the optimal which maximize the margin, i.e., the optimal separating hyperplane (OSH) separating hyperplane (OSH)

SVMS Advantages:

– Handle very large dimensionality– Empirically, have been shown to work well with

text classification Disadvantages

– sensitive to noise, such as mislabeled training examples

– binary only (but can train multiple SVMs)– implementation is complex: variety of

implementation choices (similarity measure, kernel, etc) can require extensive tuning

Summary Document classification is a common task. Manual rules provide outstanding results and allow

complex structures, but very expensive to implement. Automated methods use labeled cases to train a model

– Decision trees and decision rules are easy to understand, but require good feature set tuned to domain

– Nearest neighbor simple to implement and quick to train, but slow to classify. Can handle incremental training cases.

– Bayes is easy to implement and works well in some domains, but can have problems with highly correlated features

– SVMs more complex to implement, but handle very large dimensionality well and have proven to be best choice in many text domains