Upload
scot-boyd
View
214
Download
0
Tags:
Embed Size (px)
Citation preview
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 03
Automatic multi-label subject indexing
in a multilingual environment :
Boris LauserFood and Agriculture Organization (FAO) of the UN, Rome, Italy
Andreas HothoUniversity of Karlsruhe, Karlsruhe, Germany
ECDL 2003: Trondheim, Norway 18th August 2003
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Agenda• Introduction:
– Subject Indexing• Automatic Indexing
– Document representation model– Integration of background knowledge
• Evaluation– Test document set– Results
• Outlook• Questions and Discussion
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Subject Indexing• “Subject indexing is the act of describing a
document (or any information resource) in terms of its subject content”
• Purpose: Facilitate high precision retrieval of references on a particular subject
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
Full text search• Retrieval only based on word occurrences in
text often leads to low precision results
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Subject Indexing at the FAOControlled Vocabulary
RICE Word Tree•BT cereals
•BT plant products •UF paddy •RT oryza •RT rice flour •RT rice straw
INDIA Word Tree•BT south asia •BT asia •NT andhra pradesh •NT arunachal pradesh•NT assam •NT bihar …
Resources
Professional Indexer
Title: Indian rice productionAuthor: …Subject: Rice flour,…Geographic Cov.: Bihar…
Metadata record
Multilingual !
MultipleLabels !
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Subject Indexing at the FAO
•Over 400,000 web pages•Numerous repositories of online publications•Bibliographical databases
Rapidly growing!
Large amounts of information
•Labor intensive•Expensive•Information grows faster than professional indexing is possible
Professional Indexing
Need for automatic help in indexing and classification
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Agenda• Introduction:
– Subject Indexing• Automatic Indexing
– Document representation model– Integration of background knowledge
• Evaluation– Test document set– Results
• Outlook• Questions and Discussion
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text Categorization
documents
Automatic Classifier
documents
HumanIndexer
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text Categorization
Pre-classifieddocuments
Representationmethod
Documentword vector
SupportVector
Machines (SVM)
Automatic Classifier
document
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text CategorizationWord Vector Representation
The rice production……India…farmers grow…water irrigation…produce rice flour and…new productionlines…
Document
The
Rice
Produc
India
Farmer
Grow
Water
Irrigation
Flour
And
New
Line
1
2
3
1
1
1
1
1
1
1
1
1
Word Vector
Wordstemming
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text CategorizationWord Vector Processing
The
Rice
Produc
India
Farmer
Grow
Water
Irrigation
Flour
And
New
Line
1
2
3
1
1
1
1
1
1
1
1
1
Word Vector
Rice
Produc
India
Farmer
Grow
Water
Irrigation
Flour
Line
2
3
1
1
1
1
1
1
1
Word Vector
Rice
Produc
2
3
Word Vector
PruningStopwords
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text CategorizationBag of Words Representation
Rice Produc India …
Document 1 2 3 1 …
Document 2 0 5 0 …
Document 3 10 1 0 …
)(
||log*),(),(
tdf
Dtxtftxtfidf ii
|D| number of documentsdf(t) number of documents,
word occurred in
Weighing of word vectors with term frequency – inverted document frequency
Word vectorof document 1
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text CategorizationIntegration of Background Knowledge
AGROVOC as ontology
Background knowledge represented in form of an ontology O:• Set of Concepts C• Concept hierarchy ≤C
• Lexicon Lex
Root
Plant products
Cereals
Rice
EN: Rice FR: Riz ES: Arroz
Rice flour
EN: paddy
Asia
India
Chinarelated
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text CategorizationIntegration of Background Knowledge
Word vector with ontology integration
Rice
Produc
2
3
Rice
Produc
Rice
Cereals
Rice flour
2
3
2
2
2Concepts!Add
Other strategies:• Replace• Only (document is represented only by its concepts language independent!)
ParameterMaximum Integration Depth: 1
Integrationstrategy
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Automatic Text Categorization
Class c
Class ĉ
Document word vectors
Maximum Margin Hyperplane
Binary Support Vector Machines
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Agenda• Introduction:
– Subject Indexing• Automatic Indexing
– Document representation model– Integration of background knowledge
• Evaluation– Test document set– Results
• Outlook• Questions and Discussion
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Evaluation
Trainingdocuments
Bag of words representation,Training of SVM
Support Vector Machines
Testdocuments
Goal:To achieve the best possibleApproximation !
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Evaluation: Performance measures
Class icExpert judgements
YES NO
Classifier judgements
YES TPi FPi
NO FNi TNi
||
1
||
1
)(Precision
C
iii
C
ii
micro
FPTP
TP
||
1
||
1
)(Recall
C
iii
C
ii
micro
FNTP
TP
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
The test document setFAO library catalogue
• Journals• Proceedings• Articles• Many other resources
In 3 languages
• English• French• Spanish
AGROVOCMultilingual thesaurus
(> 16000 classes)Indexed with
keywords from
Requirement for test set:> 50 documents per class
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Evaluation:
3 evaluation settings
• Single-label vs. multi-label classification • Language recognition (single-label case, the only label is the language of the document)
• Integration of background knowledge for the single-label case
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Evaluation: ResultsSingle-label vs. multi-label classification
0.4
0.45
0.5
0.55
0.6
0.65
0.7
5 10 20 30 40 50
Training Examples
Bre
akev
en single_en
Multi_en
Multi_fr
Multi_es
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Evaluation: ResultsIntegration of background knowledge
0.0
0.2
0.4
0.6
0.8-
add
repl
ace
only
add
repl
ace
only
0 1 2
Ontology Integration
Pre
cisi
on 10 Training
Examples
50 TrainingExamples
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
• English document set• single-label case only
Reference value (no integration)
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Evaluation: Conclusion
• Support vector machines behave robust towards different languages
• Results comparatively good concerning human indexer inconsistency
• Ontology integration provides promising future possibilities
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Agenda• Introduction:
– Subject Indexing• Automatic Indexing
– Document representation model– Integration of background knowledge
• Evaluation– Test document set– Results
• Outlook• Questions and Discussion
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
OutlookIntroduction
AutomaticIndexing
Evaluation
Outlook
Discussion
Representing a document’s word vector only with its concepts found in the ontology
Language independentdocument representation!
Language independentText classifier
Possibility to• train SVM in one language only• classify documents in any language (provided by the multilingual ontology)• classify multilingual documents
Further investigation necessary on• performance loss in case of total concept representation• performance with other document sets
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
Agenda• Introduction:
– Subject Indexing• Automatic Indexing
– Document representation model– Integration of background knowledge
• Evaluation– Test document set– Results
• Outlook• Questions and Discussion
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion
FAO of the UN
Library and Documentation
Systems Division
ECDL 2003Trondheim
August 2003
References• More on automatic classification
http://www.aifb.uni-karlsruhe.de/WBS/aho/
• More on knowledge managementhttp://www.fzi.de/wim/index.html
• More on ontologies and ontology engineeringhttp://kaon.semanticweb.org
• More on FAOAGROVOC online: http://www.fao.org/agrovoc Waicent Portal: http://www.fao.org/waicent/index_en.asp
Introduction
AutomaticIndexing
Evaluation
Outlook
Discussion