26
FAO of the UN Library and Documentatio n Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment : Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany ECDL 2003: Trondheim, Norway 18 th August 200

FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

Embed Size (px)

Citation preview

Page 1: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 03

Automatic multi-label subject indexing

in a multilingual environment :

Boris LauserFood and Agriculture Organization (FAO) of the UN, Rome, Italy

Andreas HothoUniversity of Karlsruhe, Karlsruhe, Germany

ECDL 2003: Trondheim, Norway 18th August 2003

Page 2: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Agenda• Introduction:

– Subject Indexing• Automatic Indexing

– Document representation model– Integration of background knowledge

• Evaluation– Test document set– Results

• Outlook• Questions and Discussion

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 3: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Subject Indexing• “Subject indexing is the act of describing a

document (or any information resource) in terms of its subject content”

• Purpose: Facilitate high precision retrieval of references on a particular subject

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Full text search• Retrieval only based on word occurrences in

text often leads to low precision results

Page 4: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Subject Indexing at the FAOControlled Vocabulary

RICE Word Tree•BT cereals

•BT plant products •UF paddy •RT oryza •RT rice flour •RT rice straw

INDIA Word Tree•BT south asia •BT asia •NT andhra pradesh •NT arunachal pradesh•NT assam •NT bihar …

Resources

Professional Indexer

Title: Indian rice productionAuthor: …Subject: Rice flour,…Geographic Cov.: Bihar…

Metadata record

Multilingual !

MultipleLabels !

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 5: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Subject Indexing at the FAO

•Over 400,000 web pages•Numerous repositories of online publications•Bibliographical databases

Rapidly growing!

Large amounts of information

•Labor intensive•Expensive•Information grows faster than professional indexing is possible

Professional Indexing

Need for automatic help in indexing and classification

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 6: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Agenda• Introduction:

– Subject Indexing• Automatic Indexing

– Document representation model– Integration of background knowledge

• Evaluation– Test document set– Results

• Outlook• Questions and Discussion

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 7: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text Categorization

documents

Automatic Classifier

documents

HumanIndexer

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 8: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text Categorization

Pre-classifieddocuments

Representationmethod

Documentword vector

SupportVector

Machines (SVM)

Automatic Classifier

document

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 9: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text CategorizationWord Vector Representation

The rice production……India…farmers grow…water irrigation…produce rice flour and…new productionlines…

Document

The

Rice

Produc

India

Farmer

Grow

Water

Irrigation

Flour

And

New

Line

1

2

3

1

1

1

1

1

1

1

1

1

Word Vector

Wordstemming

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 10: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text CategorizationWord Vector Processing

The

Rice

Produc

India

Farmer

Grow

Water

Irrigation

Flour

And

New

Line

1

2

3

1

1

1

1

1

1

1

1

1

Word Vector

Rice

Produc

India

Farmer

Grow

Water

Irrigation

Flour

Line

2

3

1

1

1

1

1

1

1

Word Vector

Rice

Produc

2

3

Word Vector

PruningStopwords

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 11: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text CategorizationBag of Words Representation

Rice Produc India …

Document 1 2 3 1 …

Document 2 0 5 0 …

Document 3 10 1 0 …

)(

||log*),(),(

tdf

Dtxtftxtfidf ii

|D| number of documentsdf(t) number of documents,

word occurred in

Weighing of word vectors with term frequency – inverted document frequency

Word vectorof document 1

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 12: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text CategorizationIntegration of Background Knowledge

AGROVOC as ontology

Background knowledge represented in form of an ontology O:• Set of Concepts C• Concept hierarchy ≤C

• Lexicon Lex

Root

Plant products

Cereals

Rice

EN: Rice FR: Riz ES: Arroz

Rice flour

EN: paddy

Asia

India

Chinarelated

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 13: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text CategorizationIntegration of Background Knowledge

Word vector with ontology integration

Rice

Produc

2

3

Rice

Produc

Rice

Cereals

Rice flour

2

3

2

2

2Concepts!Add

Other strategies:• Replace• Only (document is represented only by its concepts language independent!)

ParameterMaximum Integration Depth: 1

Integrationstrategy

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 14: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Automatic Text Categorization

Class c

Class ĉ

Document word vectors

Maximum Margin Hyperplane

Binary Support Vector Machines

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 15: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Agenda• Introduction:

– Subject Indexing• Automatic Indexing

– Document representation model– Integration of background knowledge

• Evaluation– Test document set– Results

• Outlook• Questions and Discussion

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 16: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Evaluation

Trainingdocuments

Bag of words representation,Training of SVM

Support Vector Machines

Testdocuments

Goal:To achieve the best possibleApproximation !

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 17: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Evaluation: Performance measures

Class icExpert judgements

YES NO

Classifier judgements

YES TPi FPi

NO FNi TNi

||

1

||

1

)(Precision

C

iii

C

ii

micro

FPTP

TP

||

1

||

1

)(Recall

C

iii

C

ii

micro

FNTP

TP

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 18: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

The test document setFAO library catalogue

• Journals• Proceedings• Articles• Many other resources

In 3 languages

• English• French• Spanish

AGROVOCMultilingual thesaurus

(> 16000 classes)Indexed with

keywords from

Requirement for test set:> 50 documents per class

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 19: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Evaluation:

3 evaluation settings

• Single-label vs. multi-label classification • Language recognition (single-label case, the only label is the language of the document)

• Integration of background knowledge for the single-label case

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 20: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Evaluation: ResultsSingle-label vs. multi-label classification

0.4

0.45

0.5

0.55

0.6

0.65

0.7

5 10 20 30 40 50

Training Examples

Bre

akev

en single_en

Multi_en

Multi_fr

Multi_es

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 21: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Evaluation: ResultsIntegration of background knowledge

0.0

0.2

0.4

0.6

0.8-

add

repl

ace

only

add

repl

ace

only

0 1 2

Ontology Integration

Pre

cisi

on 10 Training

Examples

50 TrainingExamples

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

• English document set• single-label case only

Reference value (no integration)

Page 22: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Evaluation: Conclusion

• Support vector machines behave robust towards different languages

• Results comparatively good concerning human indexer inconsistency

• Ontology integration provides promising future possibilities

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 23: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Agenda• Introduction:

– Subject Indexing• Automatic Indexing

– Document representation model– Integration of background knowledge

• Evaluation– Test document set– Results

• Outlook• Questions and Discussion

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 24: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

OutlookIntroduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Representing a document’s word vector only with its concepts found in the ontology

Language independentdocument representation!

Language independentText classifier

Possibility to• train SVM in one language only• classify documents in any language (provided by the multilingual ontology)• classify multilingual documents

Further investigation necessary on• performance loss in case of total concept representation• performance with other document sets

Page 25: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

Agenda• Introduction:

– Subject Indexing• Automatic Indexing

– Document representation model– Integration of background knowledge

• Evaluation– Test document set– Results

• Outlook• Questions and Discussion

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion

Page 26: FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment

FAO of the UN

Library and Documentation

Systems Division

ECDL 2003Trondheim

August 2003

References• More on automatic classification

http://www.aifb.uni-karlsruhe.de/WBS/aho/

• More on knowledge managementhttp://www.fzi.de/wim/index.html

• More on ontologies and ontology engineeringhttp://kaon.semanticweb.org

• More on FAOAGROVOC online: http://www.fao.org/agrovoc Waicent Portal: http://www.fao.org/waicent/index_en.asp

Introduction

AutomaticIndexing

Evaluation

Outlook

Discussion