Upload
hammad-afzal
View
1.364
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Slides were presented at Terminology and KnowleTKE 2010
Citation preview
Generating Lexical Information for Terminologyin a Bioinformatics Ontology
Hammad Afzal1,3, Paul Buitelaar1, Philipp Cimiano2, John McCrae2, Tobias Wunner1
Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland1
Semantic Computing Group, Center of Excellence (CITEC), Bielefeld University, Bielefeld, Germany2
Department of Computer Science, College of Telecommunication Engineering, National University of Sciences and Technology, Pakistan3
Lack of Linguistic Expressiveness in formally specified ontologies Typically developed to provide a shared view of a domain’s knowledge. Not necessarily support the natural language processing (NLP) tasks.
Solutions: Terminologies to include linguistic information to facilitate using ontologies for text
processing, e.g. Specialist Lexicon contains lexical variants of many terms that are used in the biomedical domain.
Simple Knowledge Organization System (SKOS) format provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF).
Limitations: SKOS provides a data-model to represent classification schemas such as thesauri
etc by introducing further typology of labels (preferred, alternative, hidden etc.) and is not intended to associate more sophisticated lexical and linguistic information with an arbitrary ontology.
Motivation
Separation between linguistic and ontological Level Develop lexica independently of specific ontologies for the same domain Allow different lexica for each ontology
Independence between linguistic and ontological level No mutual constraints Ontological structures/concepts do not need to have a corresponding representation
of linguistic structure and vice versa
Detailed information on linguistic realization Part of speech, morphology (inflection, decomposition), syntactic structure (sub-
categorization frames), etc.
Support for multi-linguality
Desiderata for Ontology-Lexicon model
Towards our approach: LexInfo Recent principled approaches to associate linguistic information
to an arbitrary ontology:
LingInfo: modeling morpho-syntactic decomposition of (complex) terms [Buitelaar et al. 2006]
LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]
Lexical Markup Framework (LMF): ISO standardized model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]
LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]
Creating a LexInfo-based lexicon for lexical enrichment of a bioinformatics ontology i.e. the myGrid ontology (Wolstencroft et al., 2007).
Lexical information is derived from semantic lexicons such as WordNet (Fellbaum, 1998), and a domain related corpus.
Key points:
The capture of morpho-syntactic behavior such as part-of-speech (POS), decomposition, lemmatization and sub-categorization behaviour of lexical elements. The lexicalized terms along with their linguistic information are added to
the OWL-based lexicon based on the LexInfo model.
Case Study: Lexicalizing a bioinformatics ontology
MyGrid Ontology
Supports Service Description of bioinformatics resources through service annotation.
Manual annotation is a slow process: e.g. Taverna/Feta: only ~15-20% of services are functionally described: Result is increasingly growing of backlog of un-annotated services
Certain NLP-based attempts for automation of service descriptions are reported where myGrid ontology is used.
Lexicalization of myGrid ontology can improve performance of such approaches
Case Study: Lexicalizing a bioinformatics ontology
• LexInfo A principled way to enrich ontologies with linguistic information.
Provides a framework for automatic construction of 'lexicalized ontologies' on top of existing ontologies and lexical resources (Buitelaar et al, 2009)
• Main characteristics: Two separate domain of discourse by way if using different name spaces: Domain ontology and LexInfo Model Domain ontology defines the classes, properties and individuals in that
domain The main entities in lexical domain of discourse are instances of class
LexicalEntry. LexInfo attaches lexical information (e.g. part-of-speech, morphological, sub-
categorization) to lexical entries.
Case Study: Lexicalizing a bioinformatics ontology
Rest of the talk• Methodology
Dual approach towards lexicalization of myGrid ontology Collection of Bioinformatics Corpus Lexicalization of Class Labels Lexicalization of Property Labels
• Statistics, Experiments and Results Semi-automatically created lexicon Automatically generated lexicon
• What’s Next
Methodology - I Dual approach towards lexicalization of myGrid ontology
Semi-automatically created LexInfo-based lexicon.
Automatically created lexicon using LexInfo ontology lexicalization service.
Difference:
In Semi-automatically created lexicon, the linguistic information has been mainly derived from the domain corpus, and manually analyzed to verify correctness
In automatic generation, a generic POS-tagger and domain independent lexical resources are used to derive morpho- syntactic behaviour on the basis of an automatic analysis of the labels of the concepts, properties and individuals in the ontology
Methodology - II Collection of Bioinformatics Corpus
Domain specific behaviour (linguistic information) of the lexical entries is derived from 2691 full text journal articles of BMC Bioinformatics.
The GeniaTagger is used to get POS information; the tags of interest are Nouns, Proper Nouns, Verbs and Adjectives.
Syntactic information is derived using the Stanford parser. Currently, we have worked only on the syntactic behaviour of properties
(owl:ObjectProperty and owl:DataProperty in particular) and not of classes.
Methodology - III• Lexicalization of Class Labels (Step-wise approach)
1. LexicalEntry is created for each Class (in the domain ontology) and is linked to Class through the hasSense property.
2. The LexicalEntry is initialized as one of its sub-classes (e.g. Noun, Verb, Adjective, etc.)
3. POS tag is derived from a semantic lexicon such as WordNet and further supported from associated domain corpus
4. The lexical form (Lemma, WordForm etc) is attached to the lexical entries through the corresponding relation: hasLemma or hasWordForm.
• Lexicalization of Class Labels (Single Word)
The linking of LexicalEntry with a domain Class, and attachment of grammatical information and lemma with LexicalEntry
Methodology - III
Methodology - III• Lexicalization of Class Labels (Multi-Word)
LexInfo associates a ListOfComponents with a LexicalEntry with an ordered list of Components and size given as a DataProperty of ListOfComponents.
Each of the Components is linked with a LexicalEntry.
The validity of Component as a legitimate LexicalEntry is derived from its presence in the myGrid ontology as a separate entity, or its substantive existence in the domain corpus.
Methodology - III• Lexicalization of Class Labels (Multi-Word)
An example of morphological decomposition of a multi-word class label (from the myGrid ontology).
Methodology - IV• Lexicalization of Property Labels (Steps)
Morphological decomposition as well as the syntactic analysis of the property label is performed.
The property labels are automatically tokenized, and tokens are then linked with the LexicalEntries (Same as Classes).
On syntactic level, the tokens are analyzed to attach their respective syntactic behavior which is then linked with the subcategorization frames.
LexInfo model provides various specializations of subCategorization frames such as Transitive, TransitivePP, IntransitivePP, AdjectiveNP, NounPP and Noun2PPetc
Mapping of syntactic arguments such as Subject, Object, PObject etc. linked with the LexicalEntry to the semantic arguments such as Domain, Range, RangeOfProperty corresponding to the object property.
Methodology - IV• Lexicalization of Property Labels
In automatic lexicon generation, the lexical entries are derived automatically by processing the labels in the ontology using LILAC grammar.
LILAC production rules state part-of-speech patterns that apply to the label. For example, a label with the structure “N Prep” gives rise to a lexicon entry of type “NounPP”.
Currently, LexInfo uses 73 rules to generate lexicons automatically (further details on LexInfo homepage).
Methodology - IV• Lexicalization of Property Labels
– Lexicalization of ObjectProperty produces.
Statistics - I• Some of the statistics about the myGrid ontology
Ontology Constructs Total Number of Occurrences
Classes
Single word class labels 88
475Two word class labels 200
Three or more word class labels
187
ObjectProperties
Single word property labels 1
8Two word property labels 4
Three or more word class labels
3
DataProperties 0
Individuals 0
Statistics - II• Semi-automatically generated LexInfo based lexicon of the myGrid ontology.
LexInfo Constructs
Specialized Constructs Example Labels
Number of Entries in ‘myGrid Lexicon’
LexicalEntries
Adjective Multiple 21
Noun Alignment 752
Proper Noun Medline 253
Verb Perform 4
NounPhrase Sequence_similarity_Search 369
AdjectivePhrase Tertiary_Structure_Prediction 16
VerbPhrase Performs_task 1
Written-Form 1044List-of-Components 387
Syntactic-Behaviour
Transitive produces 4
NounPP is_part_of 4
Statistics - III• Statistics about the automatically generated LexInfo based lexicon of the
myGrid ontology using LexInfo lexicon generation service.
LexInfo Constructs Specialized Constructs Example Labels
# of Entries in ‘myGrid Lexicon’
LexicalEntry
Adjective local 131Noun Record 973Proper Noun Maize 15Verb Perform 19
NounPhrase Genotype-phenotype-database 1069
ProperNounPhrase UniProt 1VerbPhrase 0
List-of-Components 1071
Syntactic-Behaviour
Transitive produces 3NounPP is_part_of 4IntransitivePP produced_by 1
DiscussionSemi-Automatically created Lexicon
Lexicalization of Classes
Most of the LexicalEntries are of type Noun, NounPhrase and ProperNoun
Not many Verb occurrences. Class labels are mostly named using nouns, whereas the object properties are
typically named using verbs, Small number of ObjectProperties (8 properties) resulted in a smaller number
of verbs in the lexicon.
The number of Proper Nouns is 253; 32 of which are created from single-word Class names.
387 ListOfComponents are created from the 387 multi-word class names in the ontology (myGrid), 371 of them correspond to NounPhrases and 16 are AdjectivePhrases,
DiscussionSemi-Automatically created Lexicon
Lexicalization of ObjectProperties
is_identifier_of, and is_part_of lexicalized as Nouns (part and identifier) SyntacticBehavior linked to the subcategorization frame of type NounPP
(Noun: identifier, Prep: of and Noun: part, Prep:of).
performs_task and task_performed_by lexicalized as Verb (perform). SyntacticBehavior linked to the subcategorization frame of type Transitive. Both properties are inverse of each other, and are lexicalized using the
same verb, however, the mapping of syntactic arguments to domain and range is inversed in the two cases.
Produces and produced_by are lexicalized lexicalized as Verb (perform) performs_task is recognized as a VerbPhrase with performs as a Verb and a
Transitive subCategorization frame linked with it.
The syntactic behaviors of has_identifier and has_part are also modeled as NounPP.
DiscussionAutomatically generated Lexicon using LexInfo service
Lexicalization of Classes (differences from the semi-automatically created)
The number of Adjectives has significantly increased to 131 and those of ProperNouns has steeply decreased to 15. Reason is that ProperNouns are incorrectly identified as Adjectives by our POS tagger
(Stanford Tagger), e.g. DDBJ in DDBJ_Amino_Acid_Database), PIRSF in PIRSF_report are recognized as Adjectives by the POS-tagger.
This problem can be resolved by using domain corpora, or considering a domain thesaurus or dictionary etc.
The number of Verbs has increased to 19 Again due to a POS tag error: gerunds such as “manipulating”, “predicting” are
incorrectly identified as Verbs.
The identification of ProperNounPhrase is incorrect due to a tokenization error. “UniProt” is tokenized as two proper nouns, “uni” and “prot”, although it is a single word,
i.e. name of a bioinformatics database. This can also be resolved using a domain corpus or thesaurus.
DiscussionAutomatically generated Lexicon using LexInfo service
Lexicalization of ObjectProperties
ObjectProperties are mostly lexicalized correctly.
Only error is in lexicalization of “produced_by” that is recognized as IntransitivePP. This is because of an error in the ontology lexicalization (LILAC) rules which consider the occurrence of a past-participle verb followed by “by” as an occurrence of IntransitivePP.
Implementation• Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI –
National Univ. of Ireland, Galway
– https://lexinfo.googlecode.com/svn
Future Work Linguistically enriched ontology for improvement of service annotation The linguistically enriched lexicon associated with the myGrid ontology can
improve the performance of literature based approaches for automatic annotation of bioinformatics web services.
Optimization of LexInfo model by including WordNet etc. To generate all possible lexicalizations of given ontological constructs by
utilizing Synsets from WordNet and extract semantically similar verbs from VerbNet and FrameNet
LexInfo API is currently under development Allows the creation, management and serialization of ontology lexica according to the
LexInfo model. An early prototype of a lexicon generation service based on LexInfo model is also made available. Available at: http://code.google.com/p/lexinfo/
Acknowledgments• Supported in part by the European Union under Grant No. 248458 for the Monnet
project as well as by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).
• Thanks to Thomas Wangler, Michael Sintek and Matthias Mantel for their valuable contributions in designing the LexInfo model and developing the LexInfo API.
References• Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web
Resources from the Literature, In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), LNCS 5554, Springer-Verlag: 535-549.
• Afzal, H., Stevens, R., Nenadic, G. Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary, In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008):5-12.
• Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel, R., Romanelli, M., Sonntag, D., Loos, B., Micelli, V., Porzel, R. and Cimiano, P. LingInfo: Design and Applications of a Model for the Integration of Linguistic Information in Ontologies. In Proceedings of OntoLex06, a workshop at LREC, Genoa, Italy.
• Paul Buitelaar, Philipp Cimiano, Peter Haase, Michael Sintek: Towards Linguistically Grounded Ontologies. In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), Lecture Notes in Computer Science, Springer 2009.
• Cimiano, P., Haase, P., Herold, M., Mantel, M. and Buitelaar, P.: LexOnto: A model for ontology lexicons for ontology-based NLP. In Proceedings of the OntoLex (From Text to Knowledge: The Lexicon/Ontology Interface) workshop at ISWC07 (International Semantic Web Conference).
• Francopoulo, G., Bel, N., Georg, Calzolari, N., Monachini, M., Pet, M. and Soria, C.: Lexical markup framework: ISO standard for semantic information in NLP lexicons. In Proceedings of the Workshop of the GLDV Working Group on Lexicography at the Biennial Spring Conference of the GLDV
Resources Used• BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/• Genia Tagger: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/• Stanford Parser: http://nlp.stanford.edu/downloads/lex-parser.shtml• Stanford Tagger: http://nlp.stanford.edu/software/tagger.shtml• TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/