35
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek [email protected] [email protected] (610) 647-9789

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek [email protected] [email protected]

Embed Size (px)

Citation preview

Page 1: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

CSC 9010: Text Mining Applications:

Document-Based Techniques

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

Page 2: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Document Classification Document classifying

– Assign documents to pre-defined categories Examples

– Process email into work, personal, junk– Process documents from a newsgroup into

“interesting”, “not interesting”, “spam and flames”– Process transcripts of bugged phone calls into

“relevant” and “irrelevant” Issues

– Real-time?– How many categories/document? Flat or hierarchical?– Categories defined automatically or by hand?

Page 3: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Document Classification Usually

– relatively few categories– well defined; a person could do task easily– Categories don't change quickly

Flat vs Hierarchy– Simple classification is into mutually-exclusive

document collections– Richer classification is into hierarchy with multiple

inheritance– broader and narrower categories

– documents can go more than one place

– merges into search interfaces such as Pubmed

Page 4: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Classification -- Automatic Statistical approaches Set of “training” documents define categories

– Underlying representation of document derived from text– BOW– features we discussed last time

– Classification model is trained using machine learning– Individual documents classified applying the model

Requires relatively little effort to create categories

Accuracy heavily dependent on training examples

Typically limited to flat, mutually exclusive categories

Page 5: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Classification: Manual Natural Language/linguistic techniques Categories are defined by people

– underlying representation of document is typically stream of tokens

– category description contains– ontology of terms and relations

– pattern-matching rules

– individual documents classified by pattern-matching Defining categories can be very time-consuming Typically takes some experimentation to "get it

right" Can handle much more complex structures

Page 6: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm

Automatic Classification Framework

Documents PreprocessingFeature

Extraction

Feature filtering

Applyingclassificationalgorithms

Performancemeasure

Page 7: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm

Preprocessing

• Preprocessing:

transform documents into a suitable representation for classification task– Remove HTML or other tags– Remove stop words– Perform word stemming (Remove suffix)

Page 8: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm

Feature ExtractionMost crucial decision you’ll make!1. Topic

• Words, phrases, ?

2. Author• Stylistic features

3. Sentiment• Adjectives, ?

4. Spam• Specialized vocabulary

Features must relate to categories

Page 9: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm

Feature Filtering

• Feature selection:

remove non-informative terms from documents

=>improve classification effectiveness

=>reduce computational complexity

Page 10: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Evaluation We need to know how well our

classification system is performing– Recall: % of documents in a class which

are correctly classified as that class– ri = correctly classified as i / total which are i

– Precision: % of documents classified in a class which are actually in that class

– pi = correctly classified as i / total classified as i

Page 11: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Corpus

Documents classified into category

Documents actuallyin category

CorrectlyCategorized

Page 12: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Combined Effectiveness

• Ideally, we want a measure that combines both precision and recall

• F1: 2pr / p+r• If we accept everything, F1 = 0• If we accept nothing, F1 = 0• For perfect precision and recall, F1 = 1• If either precision or recall drops, so does F1

Page 13: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Measuring Individual Features

If we have a large feature set, we may be interested in which features are actually useful.

informative features: which features gives us the biggest separation between two classes

can probably omit least informative features without impacting performance

caution: correlation, not causation...

Page 14: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Choice of Evaluation Measure

For many tasks, F1 gives the best overall measure– sorting news stories– deciding genre or author

But it depends on your domain– spam filters– flagging important email

Page 15: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Evaluation: Overfitting

Training a model = predicting classification for our training set given the data in the set

Degrees of freedom: with 10 cases and 10 features I can always predict perfectly

Model may capture chance variations in set This leads to overfitting -- the model is too

closely matched to the exact data set it’s been given

More likely with– large number of features– small training sets

Page 16: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Evaluation: Training and Test Sets

To avoid (or at least detect) overfitting we always use separate training and test sets

Model is trained on one set of examples Evaluation measures are calculated on a

different set. sets should be comparable and each

should be representative of overall corpus

Page 17: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

©2012 Paula Matuszek

Some classification methods

Common classification algorithms include– nearest neighbor (KNN) methods– decision trees– naive Bayes classifiers– linear classification classifiers (e.g., SVMs)

Page 18: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

18Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Principle: points (documents) that are close

in the space belong to the same class

Page 19: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

19Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

K-Nearest-Neighbor AlgorithmK-Nearest-Neighbor Algorithm• Measure of similarity between test document and

each neighbor– count of words shared– tf*idf variants

• Select k nearest neighbors of a test document among training examples– more than 1 neighbor to avoid error of a single

atypical training example– K is typically 3 or 5

• Assign test document to the class which contains most of the neighbors

Page 20: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

20Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

Analysis of KNN AlgorithmAnalysis of KNN Algorithm• Advantages:

– Effective – Can handle large, sparse vectors– “Training time” is short– Can be incremental

• Disadvantages:– Classification time is long– Difficult to find optimal value of k

Page 21: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

21Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

Decision Tree AlgorithmDecision Tree Algorithm• Decision tree associated with document:

– Root node contains all documents– Each internal node is subset of documents

separated according to one attribute– Each arc is labeled with predicate which can

be applied to attribute at parent– Each leaf node is labeled with a class

Page 22: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Example Decision Tree

Text

no

Contains “Villanova”

0 1>1

Irrelevant Contains Wildcats

yes

Sports article Academic article

General Article

Page 23: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Decision Trees for Text

Each node is a single variable -- not useful for very large, very sparse vector such as BOW

Features might include– other document characteristics like diversity– counts for small subset of terms

– most frequent– tf*idf– domain-based ontology

Page 24: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Creating a Decision Tree

At each node, choose function which provides maximum separation

If all examples at new node are one class, stop for that node

Recur with each mixed node Stop when no choice improves

separation -- or when you reach predefined level

Page 25: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

25Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

Analysis of Decision Tree AlgorithmAnalysis of Decision Tree Algorithm

• Advantages:– Easy to understand– Easy to train– Classification is fast

• Disadvantages:– Training time is relatively expensive– A document is only connected with one branch– Once a mistake is made at a higher level, any

subtree is wrong – Not suited for very high dimensions

Page 26: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Bayesian Methods

Based on probability Used widely in probabilistic learning and

classification. Uses prior probability of each category

given no information about an item. Categorization produces a posterior

probability distribution over the possible categories given description of item.

Page 27: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Naive Bayes

Bayes Theorem says we can determine probability of an event C given another event x based on– the overall probability of event C– the probability of event x given event C

P(C|x) = P(x|C) * P(C)/p(x)

Page 28: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

28Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

Naïve Bayes AlgorithmNaïve Bayes Algorithm• Estimate the probability of each class for a

document: – Compute the posterior probability (Bayes rule)

– Assumption of word independency (Naive assumption)

Page 29: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

29Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

Analysis of Naïve Bayes Analysis of Naïve Bayes AlgorithmAlgorithm

• Advantages:– Works well on numeric and textual data– Easy implementation and computation– has been effective in practice; typical spam

filter, for instance

• Disadvantages:– Conditional independence assumption is in fact

naive: usually violated by real-world data – performs poorly when features are highly

correlated

Page 30: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Linear Regression

Classic linear regression: predict the value of some variable based on a weighted sum of other variables

Very common statistical technique for prediction

e.g.: predict college GPA with a weighted sum of SAT verbal and quantitative scores, high school GPA, and a “high school quality” measure

Page 31: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Linear Scoring Methods

Generalization of linear regression to much higher dimensionality

Goal is binary separation of instances into 2 classes

Best known is SVM: support vector machine.– classifier is a separating hyperplane– support vectors are those features which

define the plane

Page 32: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com
Page 33: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

33Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt

Support Vector MachinesSupport Vector Machines

• Main idea of SVMsMain idea of SVMsFind out the linear separating hyperplane Find out the linear separating hyperplane which maximize the margin, i.e., the optimal which maximize the margin, i.e., the optimal separating hyperplane (OSH) separating hyperplane (OSH)

Page 34: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

SVMS Advantages:

– Handle very large dimensionality– Empirically, have been shown to work well with

text classification Disadvantages

– sensitive to noise, such as mislabeled training examples

– binary only (but can train multiple SVMs)– implementation is complex: variety of

implementation choices (similarity measure, kernel, etc) can require extensive tuning

Page 35: ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com

Summary Document classification is a common task. Manual rules provide outstanding results and allow

complex structures, but very expensive to implement. Automated methods use labeled cases to train a model

– Decision trees and decision rules are easy to understand, but require good feature set tuned to domain

– Nearest neighbor simple to implement and quick to train, but slow to classify. Can handle incremental training cases.

– Bayes is easy to implement and works well in some domains, but can have problems with highly correlated features

– SVMs more complex to implement, but handle very large dimensionality well and have proven to be best choice in many text domains