31
1 Building a Multilingual Building a Multilingual Lexicon Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

Embed Size (px)

Citation preview

Page 1: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

1

Building a Multilingual LexiconBuilding a Multilingual Lexicon

by Robert Baud

SemanticMining WP20

Freiburg, 29 Mars 2004

Page 2: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

2 Natural Language Processing Group

Building a multilingual lexiconBuilding a multilingual lexicon

Starting from a model of medicine or starting from a pragmatic observation of the languages ?

What representation of knowledge is to be added to a lexicon ? The question is what makes a lexicon multilingual

From signals to understanding or the different levels of granularity of the language information

Defining the Lexicon Ontology (LO) in order to start on a sound basis.

Page 3: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

3 Natural Language Processing Group

Modeling or not?Modeling or not?

In the last decade, the idea of model of medicine was prevalent, like Snomed, Galen, UMLS, etc.

NLP was necessary as a way to help communicate the content of the model.

The principle of guidance by the model was admitted.But a general models of medicine is far from being

reallity, and this will remain true for certainly a few decades

Therefore, it is not a good idea to base the NLP on the existence of a model

Make the NLP free from any model !

Page 4: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

4 Natural Language Processing Group

has_parent

has_child

linked_to

arm

finger

hand foot

palm

Modeling the medical domainModeling the medical domain

surgery

eventprocess

top

path.normal

object

traumadisease

Light model

Page 5: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

5 Natural Language Processing Group

Local modelLocal model

Words are at different levels of detail:burn of the finger and burn of the thumbdigestive disorder and post-prandial disordervertebra and atlas

Attributes or properties are generalized to classes of concepts

Local inferences between close levels in a hierarchy of concepts is necessary before chunking information.

Page 6: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

6 Natural Language Processing Group

Semantic lexiconsSemantic lexicons

A semantic lexicon is a lexicon with attachments to existing terminologies and ontologies

But, what do we attach to what and how:Grouping of words representing the same object ?What is the semantic of this association?What about multilingual aspects ?

Problem of coherence of multiple attachments

Page 7: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

7 Natural Language Processing Group

« paupière »

« eyelid »

« Augenlid »

_eyelid

« bléphar »

« blephar »

« blephar »

« blépharo »

« blepharo »

« blepharo »

« palpébral »

« palpebral »

« ? »

_blephar

_blepharo

_palpebral

cl_Eyelid

lexical representation ontological representation

Galen

UMLS

Semantic net

MEsH

Snomed

ICD10

other

lemme level Abstract Lexical Identifier ontological levelUniversal Object Identifier

From words to objectsFrom words to objects

Page 8: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

8 Natural Language Processing Group

« corps »

« body »

« Körper »

« corps »

« body »

« Körper »

« corps »

« body »

« Körper »

« corps étranger » étranger »« foreign body »

« Fremdkörper »

cl_Body MEsH

Semantic net

etc.

cl_Trunck

cl_DeadBody

cl_ForeignBody

Dealing with proximity of wordsDealing with proximity of wordslexical representation ontological representation

_BodyAsWhole

_BodyAsTrunck

_BodyAsDeadPerson

_BodyAsForeign

lemme level Abstract Lexical Identifier ontological levelUniversal Object Identifier

Page 9: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

9 Natural Language Processing Group

From signals to understandingFrom signals to understanding

utterances

lexicon entries

language words

abstract lexical identifier

universal object identifier

object

link between objects

Page 10: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

10 Natural Language Processing Group

UtterancesUtterances

A speech, a sentence, a sign, a signal, generally issued by a human being

An expression of something to be communicatedWell-formed or ill-formedDifficulty to delimit what is a unit of

communication or a kind of atomic messageUtterances are expected to be converted to written

sentences for subsequent processing.

Page 11: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

11 Natural Language Processing Group

Lexicon entriesLexicon entries

All 3 kinds of lexicon entries are pointing to well defined objects of the world

Single word entries, without blank character, not decomposable

Word components or morphosemantems are parts of decomposition of compound words

Expressions or short terms, made of 2 to 5 words, representing single objects, like idiomatic expressions and language idiosyncracies, which cannot be represented by ordinary composition of their parts.

Page 12: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

12 Natural Language Processing Group

Language wordsLanguage words

In most natural languages, words present morphological variations, which have to be resolved

Rule-based systems are able to solve this problemFrom a sentence, a lemmatizer is a program

producing the list of the lemmes of all word – in their basic form - generally singular, masculine, nominative and infinitive, whatever applies.

A multilingual lexicon should include the definitions of the rules and should flag the regular words

Page 13: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

13 Natural Language Processing Group

Abstract lexical identifier (LID)Abstract lexical identifier (LID)The same word generally exists in different

languagesThe same word may have different lemmes in a

given languageThe information about these facts has to be

explicitely collectedThe recipient of the collection of all forms is call

an abstract lexical identifierIt is represented by a unique set of characters.

based on the English lemme, with extension when necessary.

Page 14: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

14 Natural Language Processing Group

Universal object identifier (CID)Universal object identifier (CID)

Physical objects and abstract objects are parts of the world

A unique object identifier has to be defined for the representation of each object of the domain under scrutiny

One and only one link has to be defined between an abstract lexical identifier and a object identifier

Multiple links may converge to the same object identifier.

Page 15: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

15 Natural Language Processing Group

Abdomen and its contexAbdomen and its contex

Page 16: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

16 Natural Language Processing Group

Hypertension and its contextHypertension and its context

Page 17: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

17 Natural Language Processing Group

Insect and its contextInsect and its context

Page 18: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

18 Natural Language Processing Group

Abandonment and its contextAbandonment and its context

Page 19: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

19 Natural Language Processing Group

Abscess and its contextAbscess and its context

Page 20: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

20 Natural Language Processing Group

Fœtus and its contextFœtus and its context

Page 21: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

21 Natural Language Processing Group

Actual implementationActual implementation

Page 22: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

22 Natural Language Processing Group

The Lexicon Ontology (LO)The Lexicon Ontology (LO)

To answer to the need of a formal definition of all objects implied in the building of a multilingual lexicon

Based on sound recommendations regarding modern ontologies

Insure proper communication of design between the actors of the implementation and the users

Frame-based implementation using ProtégéMay be used for a knowledge driven implementation

of the lexicon.

Page 23: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

23 Natural Language Processing Group

LO ImplementationLO Implementation

Page 24: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

24 Natural Language Processing Group

PermanentObjectPermanentObject

LexiconObjecthasAuthor StringhasDate DateTimehasSession String

PermanentObject DependantObject OccurentObject

LexiconEntryhasWID IntegerhasLID LIDhasLg LanguageIsValid YesNoIsReg YesNohasType Type

BasicEntryhasGender GenderhasNumber Number

CompoundEntryhasParts Collection

TermEntryhasWords CollectionhasHead BasicEntry

Attribute

PermanentObject

Dimanche, 28. mars 2004

The Lexicon Ontology

Is indurentIs independent

Is indurentIs dependant

Is occurent

Defines range of valuesHolds LexicalFunctionAsserts existence of objectsAsserts attribute values

Is sequence of entriesRepresents entity of thedomain

Is made of BasicEntryHas composed significanceHas role of one of its parts

Is not decomposable

Page 25: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

25 Natural Language Processing Group

Dependant ObjectsDependant ObjectsLexiconObject

hasAuthor StringhasDate DateTimehasSession String

PermanentObject DependantObject OccurentObject

Collection

LexicalRole

SyntacticRole LexicalPot

Role

DependantObject

Dimanche, 28. mars 2004

The Lexicon Ontology

Is indurentIs independent

Is indurentIs dependant

Is occurent

Specifies the lexiconfunctions

AssociatesLexiconObjects

Is collection ofLexiconEntry

Specifies thesyntactic categories

Specifies thelexical categories

Recipient Rule

ConceptPot

Is collection of LID

Provides unique labels Defines applicable rules

MorphoRule

VerbRule

Is rule formorphology

Is rule for verbs

Page 26: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

26 Natural Language Processing Group

FullWordFullWordDimanche, 28. mars 2004

The Lexicon Ontology

BasicEntry: FullWord

BasicEntryhasGender GenderhasNumber Number

FullWord PartWord

SignificantWord StopWordhasPOS POSValue

ProperWord ReferenceWordhasRef LexiconEntry

NounWord AdjectiveWord

VerbWordhasPerson PersonhasTime VerbTimehasMode VerbModehasGroup VerbGroup

Is used stand alone Is composant ofCompoundEntryIs not stand alone

Asserts existence of objectsAsserts attribute value

Is glue for sentencesIs label for unique namedobjects

Is reference to anotherLexiconEntry

Represents an objects or itsattributes

Qualifies or modifies aNounWord

Represnets actions, statesor processes

Page 27: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

27 Natural Language Processing Group

PartWordPartWordDimanche, 28. mars 2004

The Lexicon Ontology

BasicEntry: PartWord

BasicEntryhasGender GenderhasNumber Number

FullWord PartWord

PrefixWord SuffixWord

RealPrefix ModalPrefix NounSuffix

Is used stand alone Is composant ofCompoundEntryIs not stand alone

Is initial part Is final part

In place of SignificantWord Has NounRole Has AdjectiveRole

AdjectiveSuffix

Is modifyer of BasicEntry

Page 28: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

28 Natural Language Processing Group

Definition by genus and differentiaDefinition by genus and differentiaDefinitions are composed automatically by the

schema of inheritance through the isa linksA Noun is a LexiconObject which:

represents a physical or abstract object or any of their attributes,

is a building bloc of a sentence, is used stand alone in a text, is an undecomposable atom, is an object embodied in the construction of a

multilingual lexicon of the medical domain, is necessary for processing of writen medical text.

Page 29: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

29 Natural Language Processing Group

Available resourcesAvailable resources Multilingual lexicon:

French: > 35000 English: > 138000 German: > 23000 Latin: > 6500 (+ 9000) Proper names: > 3000

Tools (achievement may be dependant on the language) Word decomposition Tokenizer Error correction Several utilities: Semantic Net, Mesh, TA, etc.

Web server for lexicon access

Page 30: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

30 Natural Language Processing Group

RecommendationsRecommendations

Define the lexicon on a strong formal basisMake explicit the multilingual aspectsTake care of flectional morphologyFavour the proper treatment of compound wordsBe open to the evolution of languages and the

venue of other European languagesMake available links to well known terminologies

and ontologies

Page 31: 1 Building a Multilingual Lexicon by Robert Baud SemanticMining WP20 Freiburg, 29 Mars 2004

31

Thank you for your attentionThank you for your [email protected]