Upload
mary-logan
View
220
Download
2
Embed Size (px)
Citation preview
1
Building a Multilingual LexiconBuilding a Multilingual Lexicon
by Robert Baud
SemanticMining WP20
Freiburg, 29 Mars 2004
2 Natural Language Processing Group
Building a multilingual lexiconBuilding a multilingual lexicon
Starting from a model of medicine or starting from a pragmatic observation of the languages ?
What representation of knowledge is to be added to a lexicon ? The question is what makes a lexicon multilingual
From signals to understanding or the different levels of granularity of the language information
Defining the Lexicon Ontology (LO) in order to start on a sound basis.
3 Natural Language Processing Group
Modeling or not?Modeling or not?
In the last decade, the idea of model of medicine was prevalent, like Snomed, Galen, UMLS, etc.
NLP was necessary as a way to help communicate the content of the model.
The principle of guidance by the model was admitted.But a general models of medicine is far from being
reallity, and this will remain true for certainly a few decades
Therefore, it is not a good idea to base the NLP on the existence of a model
Make the NLP free from any model !
4 Natural Language Processing Group
has_parent
has_child
linked_to
arm
finger
hand foot
palm
Modeling the medical domainModeling the medical domain
surgery
eventprocess
top
path.normal
object
traumadisease
Light model
5 Natural Language Processing Group
Local modelLocal model
Words are at different levels of detail:burn of the finger and burn of the thumbdigestive disorder and post-prandial disordervertebra and atlas
Attributes or properties are generalized to classes of concepts
Local inferences between close levels in a hierarchy of concepts is necessary before chunking information.
6 Natural Language Processing Group
Semantic lexiconsSemantic lexicons
A semantic lexicon is a lexicon with attachments to existing terminologies and ontologies
But, what do we attach to what and how:Grouping of words representing the same object ?What is the semantic of this association?What about multilingual aspects ?
Problem of coherence of multiple attachments
7 Natural Language Processing Group
« paupière »
« eyelid »
« Augenlid »
_eyelid
« bléphar »
« blephar »
« blephar »
« blépharo »
« blepharo »
« blepharo »
« palpébral »
« palpebral »
« ? »
_blephar
_blepharo
_palpebral
cl_Eyelid
lexical representation ontological representation
Galen
UMLS
Semantic net
MEsH
Snomed
ICD10
other
lemme level Abstract Lexical Identifier ontological levelUniversal Object Identifier
From words to objectsFrom words to objects
8 Natural Language Processing Group
« corps »
« body »
« Körper »
« corps »
« body »
« Körper »
« corps »
« body »
« Körper »
« corps étranger » étranger »« foreign body »
« Fremdkörper »
cl_Body MEsH
Semantic net
etc.
cl_Trunck
cl_DeadBody
cl_ForeignBody
Dealing with proximity of wordsDealing with proximity of wordslexical representation ontological representation
_BodyAsWhole
_BodyAsTrunck
_BodyAsDeadPerson
_BodyAsForeign
lemme level Abstract Lexical Identifier ontological levelUniversal Object Identifier
9 Natural Language Processing Group
From signals to understandingFrom signals to understanding
utterances
lexicon entries
language words
abstract lexical identifier
universal object identifier
object
link between objects
10 Natural Language Processing Group
UtterancesUtterances
A speech, a sentence, a sign, a signal, generally issued by a human being
An expression of something to be communicatedWell-formed or ill-formedDifficulty to delimit what is a unit of
communication or a kind of atomic messageUtterances are expected to be converted to written
sentences for subsequent processing.
11 Natural Language Processing Group
Lexicon entriesLexicon entries
All 3 kinds of lexicon entries are pointing to well defined objects of the world
Single word entries, without blank character, not decomposable
Word components or morphosemantems are parts of decomposition of compound words
Expressions or short terms, made of 2 to 5 words, representing single objects, like idiomatic expressions and language idiosyncracies, which cannot be represented by ordinary composition of their parts.
12 Natural Language Processing Group
Language wordsLanguage words
In most natural languages, words present morphological variations, which have to be resolved
Rule-based systems are able to solve this problemFrom a sentence, a lemmatizer is a program
producing the list of the lemmes of all word – in their basic form - generally singular, masculine, nominative and infinitive, whatever applies.
A multilingual lexicon should include the definitions of the rules and should flag the regular words
13 Natural Language Processing Group
Abstract lexical identifier (LID)Abstract lexical identifier (LID)The same word generally exists in different
languagesThe same word may have different lemmes in a
given languageThe information about these facts has to be
explicitely collectedThe recipient of the collection of all forms is call
an abstract lexical identifierIt is represented by a unique set of characters.
based on the English lemme, with extension when necessary.
14 Natural Language Processing Group
Universal object identifier (CID)Universal object identifier (CID)
Physical objects and abstract objects are parts of the world
A unique object identifier has to be defined for the representation of each object of the domain under scrutiny
One and only one link has to be defined between an abstract lexical identifier and a object identifier
Multiple links may converge to the same object identifier.
15 Natural Language Processing Group
Abdomen and its contexAbdomen and its contex
16 Natural Language Processing Group
Hypertension and its contextHypertension and its context
17 Natural Language Processing Group
Insect and its contextInsect and its context
18 Natural Language Processing Group
Abandonment and its contextAbandonment and its context
19 Natural Language Processing Group
Abscess and its contextAbscess and its context
20 Natural Language Processing Group
Fœtus and its contextFœtus and its context
21 Natural Language Processing Group
Actual implementationActual implementation
22 Natural Language Processing Group
The Lexicon Ontology (LO)The Lexicon Ontology (LO)
To answer to the need of a formal definition of all objects implied in the building of a multilingual lexicon
Based on sound recommendations regarding modern ontologies
Insure proper communication of design between the actors of the implementation and the users
Frame-based implementation using ProtégéMay be used for a knowledge driven implementation
of the lexicon.
23 Natural Language Processing Group
LO ImplementationLO Implementation
24 Natural Language Processing Group
PermanentObjectPermanentObject
LexiconObjecthasAuthor StringhasDate DateTimehasSession String
PermanentObject DependantObject OccurentObject
LexiconEntryhasWID IntegerhasLID LIDhasLg LanguageIsValid YesNoIsReg YesNohasType Type
BasicEntryhasGender GenderhasNumber Number
CompoundEntryhasParts Collection
TermEntryhasWords CollectionhasHead BasicEntry
Attribute
PermanentObject
Dimanche, 28. mars 2004
The Lexicon Ontology
Is indurentIs independent
Is indurentIs dependant
Is occurent
Defines range of valuesHolds LexicalFunctionAsserts existence of objectsAsserts attribute values
Is sequence of entriesRepresents entity of thedomain
Is made of BasicEntryHas composed significanceHas role of one of its parts
Is not decomposable
25 Natural Language Processing Group
Dependant ObjectsDependant ObjectsLexiconObject
hasAuthor StringhasDate DateTimehasSession String
PermanentObject DependantObject OccurentObject
Collection
LexicalRole
SyntacticRole LexicalPot
Role
DependantObject
Dimanche, 28. mars 2004
The Lexicon Ontology
Is indurentIs independent
Is indurentIs dependant
Is occurent
Specifies the lexiconfunctions
AssociatesLexiconObjects
Is collection ofLexiconEntry
Specifies thesyntactic categories
Specifies thelexical categories
Recipient Rule
ConceptPot
Is collection of LID
Provides unique labels Defines applicable rules
MorphoRule
VerbRule
Is rule formorphology
Is rule for verbs
26 Natural Language Processing Group
FullWordFullWordDimanche, 28. mars 2004
The Lexicon Ontology
BasicEntry: FullWord
BasicEntryhasGender GenderhasNumber Number
FullWord PartWord
SignificantWord StopWordhasPOS POSValue
ProperWord ReferenceWordhasRef LexiconEntry
NounWord AdjectiveWord
VerbWordhasPerson PersonhasTime VerbTimehasMode VerbModehasGroup VerbGroup
Is used stand alone Is composant ofCompoundEntryIs not stand alone
Asserts existence of objectsAsserts attribute value
Is glue for sentencesIs label for unique namedobjects
Is reference to anotherLexiconEntry
Represents an objects or itsattributes
Qualifies or modifies aNounWord
Represnets actions, statesor processes
27 Natural Language Processing Group
PartWordPartWordDimanche, 28. mars 2004
The Lexicon Ontology
BasicEntry: PartWord
BasicEntryhasGender GenderhasNumber Number
FullWord PartWord
PrefixWord SuffixWord
RealPrefix ModalPrefix NounSuffix
Is used stand alone Is composant ofCompoundEntryIs not stand alone
Is initial part Is final part
In place of SignificantWord Has NounRole Has AdjectiveRole
AdjectiveSuffix
Is modifyer of BasicEntry
28 Natural Language Processing Group
Definition by genus and differentiaDefinition by genus and differentiaDefinitions are composed automatically by the
schema of inheritance through the isa linksA Noun is a LexiconObject which:
represents a physical or abstract object or any of their attributes,
is a building bloc of a sentence, is used stand alone in a text, is an undecomposable atom, is an object embodied in the construction of a
multilingual lexicon of the medical domain, is necessary for processing of writen medical text.
29 Natural Language Processing Group
Available resourcesAvailable resources Multilingual lexicon:
French: > 35000 English: > 138000 German: > 23000 Latin: > 6500 (+ 9000) Proper names: > 3000
Tools (achievement may be dependant on the language) Word decomposition Tokenizer Error correction Several utilities: Semantic Net, Mesh, TA, etc.
Web server for lexicon access
30 Natural Language Processing Group
RecommendationsRecommendations
Define the lexicon on a strong formal basisMake explicit the multilingual aspectsTake care of flectional morphologyFavour the proper treatment of compound wordsBe open to the evolution of languages and the
venue of other European languagesMake available links to well known terminologies
and ontologies
31
Thank you for your attentionThank you for your [email protected]