40
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University of Georgia www. cs . uga . edu /~miller/ SemWeb www. cs . uga . edu /~ helen / SemWeb / SemWeb .html

ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 2: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

2

What Has Been Done

• Extensive Research into the effectiveness of machine learning algorithms has been performed– Train System on expert created taxonomy

with expert specified documents

Page 3: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

3

What We Did

• Train system on a domain specific taxonomy– Eg. CNN’s Sports Pages

• Test system’s ability to correctly classify documents from a second, yet similar taxonomy– Eg. Yahoo! Sports Pages

Page 4: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

4

Automatic Text Classification via Statistical Methods

Text Categorization is the problem of assigning predefined categories to free text documents.

Statistical Learning Methods used in ApMl

•Bayes Method

•Rocchio Method (most popular)

•K-Nearest Neighbor Classification

•Probabilistic Indexing

Page 5: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

5

A Probabilistic Generative Model

• Define a probabilistic generative model for documents with classes.

Bayes:Reinforcement

Learning:a Survey

This paper surveysthe field of rein-

forcement learningfrom a computer

science perspective.

35 a1 block12 computer4 field1 leg7 machine44 of3 paper2 perspective1 rate5 reinforcement9 science2 survey56 the11 this1 underrated… …

“Bag-of-words”

Automatic Text Classification through Machine Learning, McCallum, et. al.

Page 6: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

6

Bayes Method

)|Pr(maxarg dcc jc j

Pick the most probable class, given the evidence:

jc

d- a class (like “Planning”)

- a document (like “language intelligence proof...”)

)Pr(

)|Pr()Pr()|Pr(

d

cdcdc jj

j

Bayes Rule:

Probability Category cj should be assigned to document d

Automatic Text Classification through Machine Learning, McCallum, et. al.

Page 7: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

7

Bayes Rule

)Pr(

)|Pr()Pr()|Pr(

d

cdcdc jj

j

)|( dcP j - Probability that document d belongs to category cj

)(dP - Probability that a randomly picked document has the same attributes

)( jcP - Probability that a randomly picked document belongs to this category

)|( cdP j- Probability that category c contains document d

Page 8: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

8

Bayes Method

• Generates conditional probabilities of particular words occurring in a document given it belongs to a particular category.

• Larger vocabulary generate better probabilities

• Each category is given a threshold p for which it judges the worthiness of a document to fall in that classification.

• Documents may fall into one, more than one, or not even one category.

Page 9: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

9

Rocchio Method

• Each document is D is represented as a vector within a given vector space V:

),...,( |)(|)1( Fddd

•Documents with similar content have similar vectors

•Each dimension of the vector space represents a word selected via a feature selection process

Page 10: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

10

Rocchio Method

• Values of d(i) for a document d are calculated as a combination of the statistics TF(w,d) and DF(w)

• TF(w,d) (Term Frequency) is the number of times word w occurs in a document d.

• DF(w) (Document Frequency) is the number of documents in which the word w occurs at least once.

Page 11: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

11

Rocchio Method• The inverse document frequency is calculated as

• Value of d(i) of feature wi for a document d is calculated as the product

)(),()(ii

i wIDFdwTFd

)log()( )(||wDF

DwIDF

•d(i) is called the weight of the word wi in the document d.

Page 12: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

12

Rocchio Method

• Based on word weight heuristics, the word wi is an important indexing term for a document d if it occurs frequently in that document

• However, words that occurs frequently in many document spanning many categories are rated less importantly

Page 13: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

13

K-Nearest Neighbor• Features

– All instances correspond to points in an n-dimensional Euclidean space

– Classification is delayed till a new instance arrives

– Classification done by comparing feature vectors of the different points

– Target function may be discrete or real-valued

K-Nearest Neighbor Learning, Dipanjan Chakraborty

Page 15: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

15

K-Nearest Neighbor• An arbitrary instance is represented by

(a1(x), a2(x), a3(x),.., an(x))– ai(x) denotes features

• Euclidean distance between two instances

d(xi, xj)=sqrt (sum for r=1 to n (ar(xi) - ar(xj))2)• Find the k-nearest neighbors whose distance

from your test cases falls within a threshold p.• If x of those k-nearest neighbors are in

category ci, then assign the test case to ci, else it is unmatched.

K-Nearest Neighbor Learning, Dipanjan Chakraborty

Page 16: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

16

Probabilistic Indexing

• Goal is to estimate P(C|si, dm)

– Probability that assignment of term si to the document dm is correct

• Once terms have been identified, assign Form Of Occurrence (FOC)– Certainty that term is correctly indentified– Significance of Term

Page 17: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

17

Probabilistic Indexing Cont.

• If term t appears in document d and a term descriptor from t to s exists, s an indexing term, then generate a descriptor indictor

• Set of generated term descriptors can be evaluated and a probability calculated that document d lies in class c

Page 18: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

18

ApMl Toolkit

• Built on top of and extends existing toolkits– rainbow (CMU) – Machine Learning– wget (GNU) – Web Crawler

• 4 Machine Learning Algorithms and 2 Classification Committees

• Web Crawler and Document Retrieval

• Automated Testing

Page 19: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

19

Machine Learning Components

• 4 Machine Learning Algorithms (rainbow)– Naïve Bayes, Rocchio, KNN, Probabilistic

Indexing

• 2 Classification Committees (ApMl)– Weight Assigned For Overall Accuracy– Weights Assigned For Accuracy within

each Class of Taxonomy

Page 20: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

20

Page 21: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

21

Page 22: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

22

Document Retrieval

• Web Crawler and Document Retrieval– Specify Starting URL– Specify Recursion Depth– Allow Multiple Domain Spanning– Specify Excluded Domains– Store all retrieved pages into a single

directory (ApMl)

Page 23: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

23

Page 24: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

24

Automated Testing

• Choose Algorithms to Test

• Choose Test Directory

• Specify Number of Tests

• All results are placed into persistent window for evaluation

Page 25: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

25

Page 26: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

26

Effectiveness: Contingency Table

Truth

Yes No

Yes a bSystem

No c d

Machine Learning for Text Classification, David D. Lewis, AT&T Labs

Page 27: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

27

• precision = a/(a+b)– Documents classified correctly vs. All classified as a particular

category

• recall = a/(a+c)– Documents classified correctly vs. All that should have been

classified in a category

• accuracy = (a+d)/(a+b+c+d)– All documents classified as positive or negative in a category

correctly vs All classified

Truth

Yes No

Yes a bSystem

No c d

Effectiveness Measures

Machine Learning for Text Classification, David D. Lewis, AT&T Labs

Page 28: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

28

Test Plan

• Choose two areas and selected subcategories– Sports

• Football• Tennis• Golf• NBA

– Health• Children• Men• Women

Page 29: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

29

Test Plan Continued

• Sport Web Sites– www.sportsillustrated.com– sports.yahoo.com– www.usatoday.com/sports/sfront.htm

• Health Web Sites– www.patient.co.uk– www.cdc.gov/health– www.bbc.co.uk/health

Page 30: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

30

Test Plan Continued

• Train the system on pages from one taxonomy from one domain and test on another taxonomy for the same area

• Determine contingency tables for each category

• Compute effectiveness using precision, recall, and accuracy

Page 31: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

31

Sports Test Results

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

Bayes KNearest

Rocchio Prob Com 1 Com 2

Precision

Recall

ApMl Test Results

Page 32: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

32

Health Test Results

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Bayes KNearest

Rocchio Prob Com 1 Com 2

Precision

Recall

ApMl Test Results

Page 33: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

33

Comparison of Precision

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Bayes KNearest

Rocchio Prob Com 1 Com 2

Sports Precision

Health Precision

ApMl Test Results

Page 34: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

34

Comparison of Recall

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Bayes KNearest

Rocchio Prob Com 1 Com 2

Sports Recall

Health Recall

ApMl Test Results

Page 35: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

35

Comparison of Sports Additional Levels

00.10.20.30.40.50.60.70.80.9

Bayes

K Nea

rest

Rocch

ioPro

b

Com 1

Com 2

Sports Precision (50)

Sports Recall (50)

Sports Precision (200)

Sports Recall (200)

ApMl Test Results

Page 36: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

36

Comparison of Health Additional Levels

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Bayes

K Nea

rest

Rocch

ioPro

b

Com 1

Com 2

Health Precision (30)

Health Recall (30)

Health Precision (60)

Health Recall (60)

ApMl Tests Results

Page 37: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

37

Comparison of Accuracy

00.10.20.30.40.50.60.70.80.9

1

Bayes

K Nea

rest

Rocch

ioPro

b

Com 1

Com 2

Sports (50)

Sports (200)

Health (30)

Health (60)

ApMl Test Results

Page 38: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

38

Trends of Results

• K Nearest Neighbor effectiveness was significantly lower than other algorithms– continuously categorize the same

• The class of Health was much more difficult for the algorithms to correctly categorize– children’s health a non-gender class

• No improvement in our results with additional training

Page 39: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

39

Conclusions

• Results of automatic text categorization are subjective

• Trends can occur because of various factors

• Heterogeneous taxonomies can be used for automatic classification with acceptable efficiencies

• More research needed

Page 40: ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University

40

Resources1. Dipanjan Chakraborty. “K-Nearest Neighbor Learning.” A

PowerPoint Presentation.2. Norbert Fuhr and Ulrich Pfeifer. “Combining Model-

Oriented and Description-Oriented Approached for Probabilistic Indexing.” Proceedings of the Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 46-56. ACM, New York. 1991.

3. Thorsten Joachims. “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization.” Technical Report, CMU, March 1996.

4. Fabrizio Sebastiani. “Machine Learning in Automated Text Categorization.” ACM Computing Surveys, 34(1):1-47, 2002.

5. Amit Sheth, et. al. “Semantic Web Content Management for Enterprises and the Web.” In submission to IEEE Internet Computing.