View
220
Download
0
Embed Size (px)
Citation preview
2
Overview
- The challenge
- Taxonomies
- Classification
- Focused Crawler
- Q&A
- Categorization
3
The challenge: Generate Structured Taxonomies of text repositories
Generate a structured taxonomy of huge text repositoriesGenerate a structured taxonomy of huge text repositories
XML
Word
Domino
Web
Catalogues
Forms
StructuredStructured
DataData
UnstructuredUnstructured
DataData
Internal DB
Business, Relevant Content
Business, Relevant Content
Information
Application
Services
4
Taxonomy
5
What is a Taxonomy
Taxonomy Taxis = arrangement or division Nomos = law
The science of classification according to a pre-determined system
Best-known use of taxonomy is in Biology taxonomies of animals and plants
6
Web Taxonomy
Best-known use of taxonomies: Web portals or Directories Internet sites classified into hierarchical topics
General:• Yahoo! http://www.yahoo.com/• Open Directory http://www.dmoz.org/ • LookSmart http://www.looksmart.com/r?country=uk
Topical:• Business.Com http://www.business.com/• HealthWeb http://www.healthweb.org/• Education Planet http://www.educationplanet.com/
7
Taxonomy - Sample
8
Taxonomy vs. Thesaurus
Criteria Taxonomy Thesaurus
Focus Documents and their organization Terms used in the organization
Usage Classification of documents Classified into categories/terms
Indexing documents Terms are attached to documents
Retrieval Mainly browsing Keyword queries
Size Restricted to the necessary terms sizes is very large (Terms may be added freely)
9
Classification
10
What is a Classifier
Concept (Topic, Subject): An abstract or generic idea generalized from
particular instances [Merriam Webster]
Classifier: A function on a concept (category) and on an
object (document) Returns a number between 0 and 1 called
confidence rate Confidence rate: measuring the confidence that
the object (document) belongs (should be classified) to the concept (category)
11
Methods for Automatic Classification
Rule based Pre-defined set of rules Advantage
• incorporating prior knowledge Disadvantages:
• extreme reliance on man-made rules • costly in terms of man-hours
Linguistics Use of morphology, syntax and semantics Not Multi lingual, demands many training
examples
Machine Learning
12
What is Machine Learning
Machine Learning is the study of computer algorithms that
automatically improve performance through
“experience”
13
Sample for Machine Learning
DOGS CATS
14
Discriminating Features
Q1: Who is this person?
Q2: What are the most discriminating features?
15
Discriminating Features
Answer: Lips Eyes
16
Discriminating Features
The “Margaret Thatcher effect”
17
Supervised Inductive Learning
A process where:
A learning algorithm is provided with a set of labeled instances, positive and negative examples (a training set)
Using the training set the leaning algorithm generates a classifier
The quality of the classifier is measured via its ability to perform well on novel instances (a test set)
18
Supervised Inductive Learning Example
Training
Test
errors
correct
19
Evaluating a Classifier
Category Classifier
20
Recall and Precision
True Label
TotalYes No
ClassifiedGood 70 50 120
Bad 30 150 180
Total 100 200 300
Precision (P) = GY / (GY + GN) = 70 / (70+50) = 0.58
F-measure (F) = 2/(1/P + 1/R) = 2*GY/(GY+GN+GY+BY) = 2*70/(100+120) = 0.63
Recall (R) = GY / (GY + BY) = 70 / (70+30) = 0.70
Accuracy (A) = (GY+NN)/(GY+GN+BY+BN) = 220 / 300 = 0.73
Use a confusion matrix to count
21
Supervised Statistical Machine Learning
A Supervised Inductive Learning method that is based on statistics obtained from the training set
Benefits Generality and flexibility
• Successfully applied across a broad spectrum of problems
Multi lingual
Low labor costs
22
How to Classify documents
Pre defined fields ( Structured data ) Author Title Date
Content ( Unstructured data ) From title, main text, emphasized text All words All 2 words, All 3 words, etc. Phrases, Synonyms, etc.
23
Getting Started
24
GammaWare Work Flow
Requirements
Design the Taxonomy
Seeding Process
Check SeedTrain
Classifiers
Catalogue Documents
Improve Classifiers
Ready
25
Requirements
Initial parameters and decisions: Level of percolation - affects:
• Recall• Precision
Multi label • Maximum number of categories into which
a document can be classified Types of training documents
• Full text, Keywords• Different types per category
List of Stop Words • Common words in the used language and
also in topic
26
Taxonomy
A Taxonomy is constructed according to: User\Business needs
• who will be using the taxonomy Data
• content of documents for classification
Good taxonomy: requires critical attention to both the definition and
application of categories and their labels simple and intuitive
How: Using the Expert Tool
27
Seeding process
Seeding process: each category within the taxonomy needs to be given a few examples of relevant documents of the same type that the user seeks to catalog
An average of 3-6 relevant documents per category Seeds can either be “positive seeds” or “negative
seeds” for each category
For better results - training documents should be in a similar structure as the documents for classification
How: Using the Expert Tool
28
Check Seed
Check seed: Classify the seeds into the taxonomy
Output: An HTML page (browsed by the Expert tool)
For each category shows the cataloging results for all the relevant seeds.
Why: Help in locating seeding problems:
Seeds that are multi labeled
Problems in taxonomy structure
How: Using the GammaWare Manager
29
Train Classifiers
Train: Train classifiers for all categories
Output: A classifier file (gcl extension) for each category
Why: The classifiers are used for categorization.
How: Using the GammaWare Manager
30
Classify Documents
Categorization: Catalogue documents into a Taxonomy
Output: A table in a database
Why: This is why we are here.
How: Using the GammaWare Manager
31
Improve Classifiers
Methods to improve classification results using the Expert Tool.
Re-design the taxonomy Seed problems
• More examples • Add new seeds
• drag and drop documents from classification view
• Negative “seeds”
Modify Categorization and Train parameters
32
Categorization
33
Hierarchical Categorization
Goal: Classify a document into the appropriate sub-topic(s) in the taxonomy
Difficulties: Many sub-topics A document may fall into several sub-
topics Classifiers are not perfect Must control “Recall” and “Precision”
according to the client’s needs
34
Hierarchical Categorization
Divide and Conquer solution: Solve the problem Level by Level At each level decompose the problem into
several, smaller sized classification sub-problems
Note: ignoring interactions between sub-problems can yield poor results
Patent Pending on CategorizationPatent Pending on Categorization
35
Focused Crawler
36
Topic Specific Crawling
Hyper-linked networks (Intranet, Internet) Two options:
• Crawl the network. Then apply classification schemes to filter relevant documents.
• Using classification schemes crawl the network while teaching the crawler to imitate (intelligent) human surfing strategies
Retrieve all documents that are relevant to a specific topic of interest
37
Simple CrawlingSimple Crawling
Crawling: The process of retrieving documents from the netCrawling: The process of retrieving documents from the net
Starting Document
The Network is huge Storage
Network Time
Good for general-purpose search engines
38
Link Classifier
Focused Crawling via Link Classifiers
Link ClassifierMy brother new
born child
Herbal tea specialist Retrieve the URL
Link is irrelevant
Link classifier: Decision according to the context of the linkLink classifier: Decision according to the context of the link
Analyze the context of the link
39
Focused Crawler – The Learning Process
Crawler Classifier: Checks if the document is good for Crawler Classifier: Checks if the document is good for CrawlingCrawling
Link Classifier
Herbal tea specialist
Retrieve the content of the link
Send acknowledgment to the “link classifier” - Learning Process
Crawler Classifier
40
GammaWare API
41
Architecture - Basic
RelationalDatabase
CustomerClient
GammaWare API
CO
RB
A GammaWareProxy
File System
File System
RelationalDatabase
GammaWareSoftware
Proxy Client
ODBC
CO
RB
A
GW File System
GW File System
Document Management
Document Management
Web
File System
File System
NotesDomino
NotesDomino
OutlookOutlook
42
Multiple Servers
GammaWareServer 4
GammaWareServer 2
Scalability and AvailabilityScalability and Availability
GammaWareServer 3
Database
GammaWareProxy
GammaWareServer
GammaWareProxy
Client
Database
43
Q & A