Catégorisation automatisée de contenus documentaires : la

Preview:

Citation preview

1

GammaWare Technology

June 2002

Yiftach Ravid, VP R&D

GammaSite Inc.

yiftach@GammaSite.com

2

Overview

- The challenge

- Taxonomies

- Classification

- Focused Crawler

- Q&A

- Categorization

3

The challenge: Generate Structured Taxonomies of text repositories

Generate a structured taxonomy of huge text repositoriesGenerate a structured taxonomy of huge text repositories

XML

Word

Domino

Web

Catalogues

Forms

Mail

StructuredStructured

DataData

UnstructuredUnstructured

DataData

Internal DB

Business, Relevant Content

Business, Relevant Content

Information

Application

Services

4

Taxonomy

5

What is a Taxonomy

Taxonomy Taxis = arrangement or division Nomos = law

The science of classification according to a pre-determined system

Best-known use of taxonomy is in Biology taxonomies of animals and plants

6

Web Taxonomy

Best-known use of taxonomies: Web portals or Directories Internet sites classified into hierarchical topics

General:• Yahoo! http://www.yahoo.com/• Open Directory http://www.dmoz.org/ • LookSmart http://www.looksmart.com/r?country=uk

Topical:• Business.Com http://www.business.com/• HealthWeb http://www.healthweb.org/• Education Planet http://www.educationplanet.com/

7

Taxonomy - Sample

8

Taxonomy vs. Thesaurus

Criteria Taxonomy Thesaurus

Focus Documents and their organization Terms used in the organization

Usage Classification of documents Classified into categories/terms

Indexing documents Terms are attached to documents

Retrieval Mainly browsing Keyword queries

Size Restricted to the necessary terms sizes is very large (Terms may be added freely)

9

Classification

10

What is a Classifier

Concept (Topic, Subject): An abstract or generic idea generalized from

particular instances [Merriam Webster]

Classifier: A function on a concept (category) and on an

object (document) Returns a number between 0 and 1 called

confidence rate Confidence rate: measuring the confidence that

the object (document) belongs (should be classified) to the concept (category)

11

Methods for Automatic Classification

Rule based Pre-defined set of rules Advantage

• incorporating prior knowledge Disadvantages:

• extreme reliance on man-made rules • costly in terms of man-hours

Linguistics Use of morphology, syntax and semantics Not Multi lingual, demands many training

examples

Machine Learning

12

What is Machine Learning

Machine Learning is the study of computer algorithms that

automatically improve performance through

“experience”

13

Sample for Machine Learning

DOGS CATS

14

Discriminating Features

Q1: Who is this person?

Q2: What are the most discriminating features?

15

Discriminating Features

Answer: Lips Eyes

16

Discriminating Features

The “Margaret Thatcher effect”

17

Supervised Inductive Learning

A process where:

A learning algorithm is provided with a set of labeled instances, positive and negative examples (a training set)

Using the training set the leaning algorithm generates a classifier

The quality of the classifier is measured via its ability to perform well on novel instances (a test set)

18

Supervised Inductive Learning Example

Training

Test

errors

correct

19

Evaluating a Classifier

Category Classifier

20

Recall and Precision

True Label

TotalYes No

ClassifiedGood 70 50 120

Bad 30 150 180

Total 100 200 300

Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58

F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63

Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70

Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73

Use a confusion matrix to count

21

Supervised Statistical Machine Learning

A Supervised Inductive Learning method that is based on statistics obtained from the training set

Benefits Generality and flexibility

• Successfully applied across a broad spectrum of problems

Multi lingual

Low labor costs

22

How to Classify documents

Pre defined fields ( Structured data ) Author Title Date

Content ( Unstructured data ) From title, main text, emphasized text All words All 2 words, All 3 words, etc. Phrases, Synonyms, etc.

23

Getting Started

24

GammaWare Work Flow

Requirements

Design the Taxonomy

Seeding Process

Check SeedTrain

Classifiers

Catalogue Documents

Improve Classifiers

Ready

25

Requirements

Initial parameters and decisions: Level of percolation - affects:

• Recall• Precision

Multi label • Maximum number of categories into which

a document can be classified Types of training documents

• Full text, Keywords• Different types per category

List of Stop Words • Common words in the used language and

also in topic

26

Taxonomy

A Taxonomy is constructed according to: User\Business needs

• who will be using the taxonomy Data

• content of documents for classification

Good taxonomy: requires critical attention to both the definition and

application of categories and their labels simple and intuitive

How: Using the Expert Tool

27

Seeding process

Seeding process: each category within the taxonomy needs to be given a few examples of relevant documents of the same type that the user seeks to catalog

An average of 3-6 relevant documents per category Seeds can either be “positive seeds” or “negative

seeds” for each category

For better results - training documents should be in a similar structure as the documents for classification

How: Using the Expert Tool

28

Check Seed

Check seed: Classify the seeds into the taxonomy

Output: An HTML page (browsed by the Expert tool)

For each category shows the cataloging results for all the relevant seeds.

Why: Help in locating seeding problems:

Seeds that are multi labeled

Problems in taxonomy structure

How: Using the GammaWare Manager

29

Train Classifiers

Train: Train classifiers for all categories

Output: A classifier file (gcl extension) for each category

Why: The classifiers are used for categorization.

How: Using the GammaWare Manager

30

Classify Documents

Categorization: Catalogue documents into a Taxonomy

Output: A table in a database

Why: This is why we are here.

How: Using the GammaWare Manager

31

Improve Classifiers

Methods to improve classification results using the Expert Tool.

Re-design the taxonomy Seed problems

• More examples • Add new seeds

• drag and drop documents from classification view

• Negative “seeds”

Modify Categorization and Train parameters

32

Categorization

33

Hierarchical Categorization

Goal: Classify a document into the appropriate sub-topic(s) in the taxonomy

Difficulties: Many sub-topics A document may fall into several sub-

topics Classifiers are not perfect Must control “Recall” and “Precision”

according to the client’s needs

34

Hierarchical Categorization

Divide and Conquer solution: Solve the problem Level by Level At each level decompose the problem into

several, smaller sized classification sub-problems

Note: ignoring interactions between sub-problems can yield poor results

Patent Pending on CategorizationPatent Pending on Categorization

35

Focused Crawler

36

Topic Specific Crawling

Hyper-linked networks (Intranet, Internet) Two options:

• Crawl the network. Then apply classification schemes to filter relevant documents.

• Using classification schemes crawl the network while teaching the crawler to imitate (intelligent) human surfing strategies

Retrieve all documents that are relevant to a specific topic of interest

37

Simple CrawlingSimple Crawling

Crawling: The process of retrieving documents from the netCrawling: The process of retrieving documents from the net

Starting Document

The Network is huge Storage

Network Time

Good for general-purpose search engines

38

Link Classifier

Focused Crawling via Link Classifiers

Link ClassifierMy brother new

born child

Herbal tea specialist Retrieve the URL

Link is irrelevant

Link classifier: Decision according to the context of the linkLink classifier: Decision according to the context of the link

Analyze the context of the link

39

Focused Crawler – The Learning Process

Crawler Classifier: Checks if the document is good for Crawler Classifier: Checks if the document is good for CrawlingCrawling

Link Classifier

Herbal tea specialist

Retrieve the content of the link

Send acknowledgment to the “link classifier” - Learning Process

Crawler Classifier

40

GammaWare API

41

Architecture - Basic

RelationalDatabase

CustomerClient

GammaWare API

CO

RB

A GammaWareProxy

File System

File System

RelationalDatabase

GammaWareSoftware

Proxy Client

ODBC

CO

RB

A

GW File System

GW File System

Document Management

Document Management

Web

File System

File System

NotesDomino

NotesDomino

OutlookOutlook

42

Multiple Servers

GammaWareServer 4

GammaWareServer 2

Scalability and AvailabilityScalability and Availability

GammaWareServer 3

Database

GammaWareProxy

GammaWareServer

GammaWareProxy

Client

Database

43

Q & A