76
Text Categorization PengBo 10/31/2010

Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Embed Size (px)

Citation preview

Page 1: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Text Categorization

PengBo10/31/2010

Page 2: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

本次课大纲

Text Categorization Problem definition Build a Classifier

Naïve Bayes Classifier K-Nearest Neighbor Classifier

Evaluation

Page 3: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Definition

Given: 实例 instance, xX, where X is the instance

language or instance space. Issue: how to represent text documents.

固定的类别集合 categories: C = {c1, c2,…, cn}

Determine: The category of x : c(x)C,

where c(x) is a categorization function 分类函数 We want to know how to build categorization

functions (“classifiers 分类器” ).

Page 4: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Text Categorization Examples

Assign labels to each document or web-page: Labels are most often topics such as Yahoo-categories

e.g., "finance," "sports," "news>world>asia>business" Labels may be genres

e.g., "editorials" "movie-reviews" "news“ Labels may be opinion

e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary

e.g., "interesting-to-me" : "not-interesting-to-me” e.g., “spam” : “not-spam” e.g., “contains adult language” :“doesn’t”

Page 5: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Classification Methods

人工分类 Manual classification Used by Yahoo!, Looksmart, about.com, ODP, Medline Accurate but expensive to scale

自动文本分类 Automatic document classification 基于规则: Hand-coded rule-based systems

Spam mail filter,… 有监督的学习: Supervised learning of a document-

label assignment function No free lunch: requires 人工标注的训练集 hand-

classified training data Note that many commercial systems use a

mixture of methods

Page 6: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Think about it…

How to represent text documents and categories? Vectors & Regions String & Language (models)

How to build categorization functions ? Closeness/Similarity to regions Probability to generate the

string/language model

Page 7: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

K-Nearest Neighbors

Page 8: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Classes in a Vector Space

Government

Science

Arts

Page 9: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Classification Using Vector Spaces

Each training doc a point (vector) labeled by its topic (= class)

Hypothesis: docs of the same class form a contiguous region of space

We define surfaces to delineate classes in space

Page 10: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Test Document = Government

Government

Science

Arts

Similarityhypothesistrue ingeneral?

Page 11: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

k Nearest Neighbor Classification

To classify document d into class c Define k-neighborhood N as k nearest

neighbors of d Count number of documents i in N that belong

to c Estimate P(c|d) as i/k Choose as class argmaxc P(c|d) [ = majority

class]

Page 12: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Example: k=6 (6NN)

Government

Science

Arts

P(science| )?

Page 13: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Nearest-Neighbor Learning Algorithm

Learning is just storing the representations of the training examples in D.

Testing instance x: Compute similarity between x and all examples

in D. Assign x the category of the most similar

example in D. Does not explicitly compute a generalization

or category prototypes. Also called:

Case-based learning Memory-based learning Lazy learning

Page 14: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Why K?

Using only the closest example to determine the categorization is subject to errors due to: A single atypical example. Noise (i.e. error) in the category label of a single

training example. More robust alternative is to find the k

most-similar examples and return the majority category of these k examples.

Value of k is typically odd to avoid ties; 3 and 5 are most common.

Page 15: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

kNN decision boundaries

Government

Science

Arts

Boundaries are in principle arbitrary surfaces – but usually polyhedra

Page 16: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Similarity Metrics

Nearest neighbor method depends on a similarity (or distance) metric. Simplest for continuous m-dimensional instance

space is Euclidean distance. Simplest for m-dimensional binary instance

space is Hamming distance (number of feature values that differ).

For text, cosine similarity of tf.idf weighted vectors is typically most effective.

Page 17: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Illustration of 3 Nearest Neighbor for Text Vector Space

Page 18: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Nearest Neighbor with Inverted Index

Naively finding nearest neighbors requires a linear search through |D| documents in collection

But determining k nearest neighbors is the same as determining the top-k best retrievals using the test document as a query to a database of training documents.

Use standard vector space inverted index methods to find the k nearest neighbors.

Testing Time: O(B|Vt|) where B is the average number of training documents in which a test-document word appears. Typically B << |D|

Page 19: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

kNN: Discussion

No training necessary No feature selection necessary Scales well with large number of classes

Don’t need to train n classifiers for n classes Classes can influence each other

Small changes to one class can have ripple effect

Scores can be hard to convert to probabilities

Page 20: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Naïve Bayes

Page 21: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Bayes Classifiers

Task: Classify a new instance D based on a tuple of attribute values into one of the classes cj C nxxxD ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

Cc xxxP

cPcxxxP

j

)()|,,,(argmax 21 jjnCc

cPcxxxPj

Page 22: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Naïve Bayes Assumption

P(cj) Can be estimated from the frequency of classes

in the training examples. P(x1,x2,…,xn|cj)

O(|X|n•|C|) parameters Could only be estimated if a very, very large

number of training examples was available.

Page 23: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

Conditional Independence Assumption

Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

features detect term presence and are independent of each other given the class:)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Page 24: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Learning the Model

First attempt: maximum likelihood estimates simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP jj

)()(ˆ

Page 25: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

What if we have seen no training cases where patient had no flu and muscle aches?

Zero probabilities cannot be conditioned away, no matter the other evidence!

Problem with Max Likelihood

0)(

),()|(ˆ 5

5

nfCN

nfCtXNnfCtXP

i ic cxPcP )|(ˆ)(ˆmaxarg

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Page 26: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Smoothing to Eliminate Zeros

kcCN

cCxXNcxP

j

jiiji

)(

1),()|(ˆ

# of values of Xi

Add one smooth (Laplace smoothing) As a uniform prior (each attribute occurs once for

each class) that is then updated as evidence from the training data comes in.

Page 27: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Document Generative Model

“Love is patient, love is kind. “ Basic : bag of words a binary independence model

Multivariate binomial generation feature Xi is term Value Xi = 1 or 0, indicate term Xi present in doc

or not a multinomial unigram language model

Multinomial generation feature Xi is term position Value of Xi = term at position i position independent

Love is

patient

kind

Page 28: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Bernoulli Naive Bayes Classifiers

Multivariate binomial Model One feature Xw for each word in

dictionary Value Xw = true in document d if w

appears in d Naive Bayes assumption:

Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Love is

patient

kind

Page 29: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Multinomial Naive Bayes Classifiers II

Multinomial = Class conditional unigram One feature Xi for each word pos in

document feature’s values are all words in

dictionary Value of Xi is the word in position i Naïve Bayes assumption:

Given the document’s topic, word in one position in the document tells us nothing about words in other positions

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Page 30: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Still too many possibilities Assume that classification is independent of

the positions of the words Second assumption:

Word appearance does not depend on position

Just have one multinomial feature predicting for all words

Use same parameters for each position

Multinomial Naive Bayes Classifiers

)|()|( cwXPcwXP ji for all positions i,j, word w, and class c

Page 31: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Parameter estimation

fraction of documents of topic cj

in which word w appears

Binomial model:

Multinomial model:

)|(ˆjw ctXP

fraction of times in which word w appears

across all documents of topic cj

)|(ˆji cwXP

Page 32: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Naive Bayes algorithm (Multinomial model)

Page 33: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Naive Bayes algorithm (Bernoulli model)

Page 34: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

NB Example

c(5)=?

Page 35: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

NB Example

c(5)=?

Page 36: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Multinomial NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) = China

Page 37: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

NB Example

c(5)=?

Page 38: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Bernoulli NB Classifier

Feature likelihood estimate

Posterior

Result: c(5) <> China

Page 39: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Classification

Multinomial vs Multivariate binomial?

Multinomial is in general better

Page 40: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Classification Evaluation

Page 41: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Let’s think about it…

How to do evaluation experiments for classifiers? Dataset Measures Generalization Performance

Training setTraining set

Test setTest set

RecallRecall PrecisionPrecision

F1F1 AccuracyAccuracy

Page 42: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Most (over)used data set 21578 documents 9603 training, 3299 test articles (ModApte

split) 118 categories

An article can be in more than one category Learn 118 binary category distinctions

Average document: about 90 types, 200 tokens

Average number of classes assigned 1.24 for docs with at least one category

Only about 10 out of 118 categories are largeCommon categories(#train, #test)

Classic Reuters Data Set

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Page 43: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Measuring ClassificationFigures of Merit

Accuracy of classification Main evaluation criterion in academia

Speed of training statistical classifier Some methods are very cheap; some very

costly Speed of classification (docs/hour)

No big differences for most algorithms Exceptions: kNN, complex preprocessing

requirements Effort in creating training set/hand-built

classifier human hours/topic

Page 44: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Measuring ClassificationFigures of Merit

In the real world, economic measures: Your choices are:

Do no classification That has a cost (hard to compute)

Do it all manually Has an easy to compute cost if doing it like that now

Do it all with an automatic classifier Mistakes have a cost

Do it with a combination of automatic classification and manual review of uncertain/difficult/”new” cases

Commonly the last method is most cost efficient and is adopted

Page 45: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Per class evaluation measures

Recall: Fraction of docs in class i classified correctly:

Precision: Fraction of docs assigned class i that are actually about class i:

“Correct rate”: (1- error rate) Fraction of docs classified correctly:

j

ij

ii

c

c

j

ji

ii

c

c

jij

i

iii

c

c

A B C

A

B

C

Actual Class

Predictedclass

Page 46: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Measuring Classification

Overall error rate Not a good measure for small classes. Why?

Precision/recall for classification decisions F1 measure: 1/F1 = ½ (1/P + 1/R) Correct estimate of size of category

Why is this different? Stability over time / category drift Utility

Costs of false positives / false negatives may be different

For example, cost = tp-0.5fp

Page 47: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Generalization Performance

Generalization Performance Results can vary based on sampling error

due to different training and test sets. Average results over multiple training and

test sets (splits of the overall data) for the best results.

Ideally, test and training sets are independent on each trial.

But this would require too much labeled data.But this would require too much labeled data.

Page 48: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Good practice department

N-Fold Cross-Validation Partition data into N equal-sized disjoint

segments. Run N trials, each time using a different

segment of the data for testing, and training on the remaining N1 segments.

This way, at least test-sets are independent.

Report average classification accuracy over the N trials.

Typically, N = 10.

Page 49: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Good practice department II

Learning Curves Would like to know how performance varies with

the number of training instances. Learning curves plot classification accuracy on

independent test data (Y axis) versus number of training examples (X axis).

One can do both the above and produce learning curves averaged over multiple trials from cross-validation

Page 50: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

how to combine multiple measures?

If we have more than one class, how do we combine multiple performance measures into one quantity?

Macroaveraging: Compute performance for each class, then

average. Microaveraging:

Collect decisions for all classes, compute contingency table, evaluate.

Page 51: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

10 10

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

90 10

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro.Av. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Why this difference?

Page 52: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Exercise

Federalist papers 1787-1788 年间由

Hamilton, Jay and Madison 用笔名发表的 77 篇短文,来劝说NY 支持 US Constitution

其中 12 篇 papers 的作者存在争议

谁是作者? 谁是作者?

Page 53: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

53

Author identification

In 1964 Mosteller and Wallace* solved the problem Mosteller, Frederick and Wallace, David L.

1964. Inference and Disputed Authorship: The Federalist.

It’s a Text Catergorization Problem They identified 70 function words as good

candidates for authorship analysis Using statistical inference they concluded

the author was Madison

Feature Selection

Classifier

Page 54: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Function words for Author Identification

Page 55: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Function Words for Author Identification

Page 56: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Summary

Definition The category of x:

c(x)C K-Nearest Neighbor Naïve Bayes

Bayesian Methods Bernoulli NB classifier Multinomial NB classifier

Categorization Evaluation

Training data/Test data Over-fitting & Generalize

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Page 57: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Thank You!

Q&A

Page 58: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Readings

[1] IIR Ch13, Ch14.2 [2] Y. Yang and X. Liu, "A re-examination of

text categorization methods," presented at Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 1999.

Page 59: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Bernoulli trial

Bernoulli trial is an experiment whose outcome is random and can be either of two possible outcomes, "success" and "failure".

Page 60: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Binomial Distribution

binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p.

Page 61: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Multinomial Distribution

multinomial distribution is a generalization of the binomial distribution.

each trial results in one of some fixed finite number k of possible outcomes, with probabilities p1, ..., pk,

there are n independent trials.

We can use a random variable Xi to indicate the number of times outcome number i was observed over the n trials.

Page 62: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Bayes’ Rule

priorprior

posteriorposterior likelihoodlikelihood

Page 63: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Use Bayes Rule to Gamble

someone draws an envelope at random and offers to sell it to you. How much should you pay?

before deciding, you are allowed to see one bead drawn from the envelope. Suppose it’s red: How much should you pay?

Page 64: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Prosecutor's fallacy

You win the lottery jackpot. You are then charged with having cheated, for instance with having bribed lottery officials.

At the trial, the prosecutor points out that winning the lottery without cheating is extremely unlikely, and that therefore your being innocent must be comparably unlikely.

Page 65: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDP

Hh

)()|(argmax hPhDPHh

As P(D) isconstant

Page 66: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only

need to consider the P(D|h) term:)|(argmax hDPh

HhML

Page 67: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Likelihood

a likelihood function is a conditional probability function considered as a function of its second argument with its first argument held fixed

Given a parameterized family of probability density functions

Where θ is the parameter, the likelihood function is

where x is the observed outcome of an experiment.

when f(x | θ) is viewed as a function of x with θ fixed, it is a probability density function,

when viewed as a function of θ with x fixed, it is a likelihood function.

)|( xfx

)|()|( xfxL

Page 68: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Reuters Text Categorization data set (Reuters-21578) document

<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798">

<DATE> 2-MAR-1987 16:51:43.42</DATE>

<TOPICS><D>livestock</D><D>hog</D></TOPICS>

<TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE>

<DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off

tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC.

Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.

A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

&#3;</BODY></TEXT></REUTERS>

Page 69: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

New Reuters: RCV1: 810,000 docs

Top topics in Reuters RCV1

Page 70: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

北大天网 : 中文网页分类 通过动员不同专业的几十个学生,人工选取形成了一个基

于层次模型的大规模中文网页样本集。 包括 12,336 个训练网页实例和 3,269 个测试网页实例,分

布在 12 个大类 , 共计 733 个类别中,每个类别平均有 17个训练实例和 4.5 个测试实例

天网免费提供网页样本集给有兴趣的同行,燕穹产品号:YQ-WEBENCH-V0.8

中文信息检索论坛 www.cwirf.org 全国搜索引擎和网上信息挖掘学术研讨会 SEWM 上进行分

类评测

Page 71: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

北大天网 : 中文网页分类

类别编号

类别名称 类别数 训练样本数

测试样本数

1 人文与艺术 24 419 110

2 新闻与媒体 7 125 19

3 商业与经济 48 839 214

4 娱乐与休闲 88 1510 374

5 计算机与因特网 58 925 238

6 教育 18 286 85

7 各国风情 53 891 235

8 自然科学 113 1892 514

9 政府与政治 18 288 84

10 社会科学 104 1765 479

11 医疗与健康 136 2295 616

12 社会与文化 66 1101 301

共计 733 12336 3269

 

Page 72: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Concept Drift

Categories change over time Example: “president of the united states”

1999: clinton is great feature 2002: clinton is bad feature

One measure of a text classification system is how well it protects against concept drift.

Feature selection: can be bad in protecting against concept drift

Page 73: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier
Page 74: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier
Page 75: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Think about it…

Describe the process, can you tell what is it?

How do you do it? Why we talk about it here, or what does it

mean to Information Overloading?

鹰 肉鸡

Page 76: Text Categorization PengBo 10/31/2010. 本次课大纲 Text Categorization Problem definition Build a Classifier Naïve Bayes Classifier K-Nearest Neighbor Classifier

Recall: Vector Space Representation

Each document is a vector, one component for each term (= word).

Normalize to unit length. High-dimensional vector space:

Terms are axes 10,000+ dimensions, or even 100,000+ Docs are vectors in this space