Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Generating Lexical Information for Terminologyin a Bioinformatics Ontology

Hammad Afzal1,3, Paul Buitelaar1, Philipp Cimiano2, John McCrae2, Tobias Wunner1

Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland1

Semantic Computing Group, Center of Excellence (CITEC), Bielefeld University, Bielefeld, Germany2

Department of Computer Science, College of Telecommunication Engineering, National University of Sciences and Technology, Pakistan3

Lack of Linguistic Expressiveness in formally specified ontologies Typically developed to provide a shared view of a domain’s knowledge. Not necessarily support the natural language processing (NLP) tasks.

Solutions: Terminologies to include linguistic information to facilitate using ontologies for text

processing, e.g. Specialist Lexicon contains lexical variants of many terms that are used in the biomedical domain.

Simple Knowledge Organization System (SKOS) format provides a standard way to represent knowledge organization systems using the Resource Description Framework (RDF).

Limitations: SKOS provides a data-model to represent classification schemas such as thesauri

etc by introducing further typology of labels (preferred, alternative, hidden etc.) and is not intended to associate more sophisticated lexical and linguistic information with an arbitrary ontology.

Motivation

Separation between linguistic and ontological Level Develop lexica independently of specific ontologies for the same domain Allow different lexica for each ontology

Independence between linguistic and ontological level No mutual constraints Ontological structures/concepts do not need to have a corresponding representation

of linguistic structure and vice versa

Detailed information on linguistic realization Part of speech, morphology (inflection, decomposition), syntactic structure (sub-

categorization frames), etc.

Support for multi-linguality

Desiderata for Ontology-Lexicon model

Towards our approach: LexInfo Recent principled approaches to associate linguistic information

to an arbitrary ontology:

LingInfo: modeling morpho-syntactic decomposition of (complex) terms [Buitelaar et al. 2006]

LexOnto: capturing syntactic behaviour and syntax-semantics links [Cimiano et al. 2007]

Lexical Markup Framework (LMF): ISO standardized model for representing machine readable lexica (agnostic about connection with ontology) [Francopoulo et al. 2007]

LexInfo: building on LMF as a core, develop a model which “subsumes” LingInfo and LexOnto for flexibly associating linguistic information to ontologies [Buitelaar, Cimiano, Haase, Sintek 2009]

Creating a LexInfo-based lexicon for lexical enrichment of a bioinformatics ontology i.e. the myGrid ontology (Wolstencroft et al., 2007).

Lexical information is derived from semantic lexicons such as WordNet (Fellbaum, 1998), and a domain related corpus.

Key points:

The capture of morpho-syntactic behavior such as part-of-speech (POS), decomposition, lemmatization and sub-categorization behaviour of lexical elements. The lexicalized terms along with their linguistic information are added to

the OWL-based lexicon based on the LexInfo model.

Case Study: Lexicalizing a bioinformatics ontology

MyGrid Ontology

Supports Service Description of bioinformatics resources through service annotation.

Manual annotation is a slow process: e.g. Taverna/Feta: only ~15-20% of services are functionally described: Result is increasingly growing of backlog of un-annotated services

Certain NLP-based attempts for automation of service descriptions are reported where myGrid ontology is used.

Lexicalization of myGrid ontology can improve performance of such approaches


• LexInfo A principled way to enrich ontologies with linguistic information.

Provides a framework for automatic construction of 'lexicalized ontologies' on top of existing ontologies and lexical resources (Buitelaar et al, 2009)

• Main characteristics: Two separate domain of discourse by way if using different name spaces: Domain ontology and LexInfo Model Domain ontology defines the classes, properties and individuals in that

domain The main entities in lexical domain of discourse are instances of class

LexicalEntry. LexInfo attaches lexical information (e.g. part-of-speech, morphological, sub-

categorization) to lexical entries.


Rest of the talk• Methodology

Dual approach towards lexicalization of myGrid ontology Collection of Bioinformatics Corpus Lexicalization of Class Labels Lexicalization of Property Labels

• Statistics, Experiments and Results Semi-automatically created lexicon Automatically generated lexicon

• What’s Next

Methodology - I Dual approach towards lexicalization of myGrid ontology

Semi-automatically created LexInfo-based lexicon.

Automatically created lexicon using LexInfo ontology lexicalization service.

Difference:

In Semi-automatically created lexicon, the linguistic information has been mainly derived from the domain corpus, and manually analyzed to verify correctness

In automatic generation, a generic POS-tagger and domain independent lexical resources are used to derive morpho- syntactic behaviour on the basis of an automatic analysis of the labels of the concepts, properties and individuals in the ontology

Methodology - II Collection of Bioinformatics Corpus

Domain specific behaviour (linguistic information) of the lexical entries is derived from 2691 full text journal articles of BMC Bioinformatics.

The GeniaTagger is used to get POS information; the tags of interest are Nouns, Proper Nouns, Verbs and Adjectives.

Syntactic information is derived using the Stanford parser. Currently, we have worked only on the syntactic behaviour of properties

(owl:ObjectProperty and owl:DataProperty in particular) and not of classes.

Methodology - III• Lexicalization of Class Labels (Step-wise approach)

1. LexicalEntry is created for each Class (in the domain ontology) and is linked to Class through the hasSense property.

2. The LexicalEntry is initialized as one of its sub-classes (e.g. Noun, Verb, Adjective, etc.)

3. POS tag is derived from a semantic lexicon such as WordNet and further supported from associated domain corpus

4. The lexical form (Lemma, WordForm etc) is attached to the lexical entries through the corresponding relation: hasLemma or hasWordForm.

• Lexicalization of Class Labels (Single Word)

The linking of LexicalEntry with a domain Class, and attachment of grammatical information and lemma with LexicalEntry

Methodology - III

Methodology - III• Lexicalization of Class Labels (Multi-Word)

LexInfo associates a ListOfComponents with a LexicalEntry with an ordered list of Components and size given as a DataProperty of ListOfComponents.

Each of the Components is linked with a LexicalEntry.

The validity of Component as a legitimate LexicalEntry is derived from its presence in the myGrid ontology as a separate entity, or its substantive existence in the domain corpus.

Methodology - III• Lexicalization of Class Labels (Multi-Word)

An example of morphological decomposition of a multi-word class label (from the myGrid ontology).

Methodology - IV• Lexicalization of Property Labels (Steps)

Morphological decomposition as well as the syntactic analysis of the property label is performed.

The property labels are automatically tokenized, and tokens are then linked with the LexicalEntries (Same as Classes).

On syntactic level, the tokens are analyzed to attach their respective syntactic behavior which is then linked with the subcategorization frames.

LexInfo model provides various specializations of subCategorization frames such as Transitive, TransitivePP, IntransitivePP, AdjectiveNP, NounPP and Noun2PPetc

Mapping of syntactic arguments such as Subject, Object, PObject etc. linked with the LexicalEntry to the semantic arguments such as Domain, Range, RangeOfProperty corresponding to the object property.

Methodology - IV• Lexicalization of Property Labels

In automatic lexicon generation, the lexical entries are derived automatically by processing the labels in the ontology using LILAC grammar.

LILAC production rules state part-of-speech patterns that apply to the label. For example, a label with the structure “N Prep” gives rise to a lexicon entry of type “NounPP”.

Currently, LexInfo uses 73 rules to generate lexicons automatically (further details on LexInfo homepage).

Methodology - IV• Lexicalization of Property Labels

– Lexicalization of ObjectProperty produces.

Statistics - I• Some of the statistics about the myGrid ontology

Ontology Constructs Total Number of Occurrences

Classes

Single word class labels 88

475Two word class labels 200

Three or more word class labels

187

ObjectProperties

Single word property labels 1

8Two word property labels 4

Three or more word class labels

3

DataProperties 0

Individuals 0

Statistics - II• Semi-automatically generated LexInfo based lexicon of the myGrid ontology.

LexInfo Constructs

Specialized Constructs Example Labels

Number of Entries in ‘myGrid Lexicon’

LexicalEntries

Adjective Multiple 21

Noun Alignment 752

Proper Noun Medline 253

Verb Perform 4

NounPhrase Sequence_similarity_Search 369

AdjectivePhrase Tertiary_Structure_Prediction 16

VerbPhrase Performs_task 1

Written-Form 1044List-of-Components 387

Syntactic-Behaviour

Transitive produces 4

NounPP is_part_of 4

Statistics - III• Statistics about the automatically generated LexInfo based lexicon of the

myGrid ontology using LexInfo lexicon generation service.

LexInfo Constructs Specialized Constructs Example Labels

# of Entries in ‘myGrid Lexicon’

LexicalEntry

Adjective local 131Noun Record 973Proper Noun Maize 15Verb Perform 19

NounPhrase Genotype-phenotype-database 1069

ProperNounPhrase UniProt 1VerbPhrase 0

List-of-Components 1071

Syntactic-Behaviour

Transitive produces 3NounPP is_part_of 4IntransitivePP produced_by 1

DiscussionSemi-Automatically created Lexicon

Lexicalization of Classes

Most of the LexicalEntries are of type Noun, NounPhrase and ProperNoun

Not many Verb occurrences. Class labels are mostly named using nouns, whereas the object properties are

typically named using verbs, Small number of ObjectProperties (8 properties) resulted in a smaller number

of verbs in the lexicon.

The number of Proper Nouns is 253; 32 of which are created from single-word Class names.

387 ListOfComponents are created from the 387 multi-word class names in the ontology (myGrid), 371 of them correspond to NounPhrases and 16 are AdjectivePhrases,

DiscussionSemi-Automatically created Lexicon

Lexicalization of ObjectProperties

is_identifier_of, and is_part_of lexicalized as Nouns (part and identifier) SyntacticBehavior linked to the subcategorization frame of type NounPP

(Noun: identifier, Prep: of and Noun: part, Prep:of).

performs_task and task_performed_by lexicalized as Verb (perform). SyntacticBehavior linked to the subcategorization frame of type Transitive. Both properties are inverse of each other, and are lexicalized using the

same verb, however, the mapping of syntactic arguments to domain and range is inversed in the two cases.

Produces and produced_by are lexicalized lexicalized as Verb (perform) performs_task is recognized as a VerbPhrase with performs as a Verb and a

Transitive subCategorization frame linked with it.

The syntactic behaviors of has_identifier and has_part are also modeled as NounPP.

DiscussionAutomatically generated Lexicon using LexInfo service

Lexicalization of Classes (differences from the semi-automatically created)

The number of Adjectives has significantly increased to 131 and those of ProperNouns has steeply decreased to 15. Reason is that ProperNouns are incorrectly identified as Adjectives by our POS tagger

(Stanford Tagger), e.g. DDBJ in DDBJ_Amino_Acid_Database), PIRSF in PIRSF_report are recognized as Adjectives by the POS-tagger.

This problem can be resolved by using domain corpora, or considering a domain thesaurus or dictionary etc.

The number of Verbs has increased to 19 Again due to a POS tag error: gerunds such as “manipulating”, “predicting” are

incorrectly identified as Verbs.

The identification of ProperNounPhrase is incorrect due to a tokenization error. “UniProt” is tokenized as two proper nouns, “uni” and “prot”, although it is a single word,

i.e. name of a bioinformatics database. This can also be resolved using a domain corpus or thesaurus.

DiscussionAutomatically generated Lexicon using LexInfo service

Lexicalization of ObjectProperties

ObjectProperties are mostly lexicalized correctly.

Only error is in lexicalization of “produced_by” that is recognized as IntransitivePP. This is because of an error in the ontology lexicalization (LILAC) rules which consider the occurrence of a past-participle verb followed by “by” as an occurrence of IntransitivePP.

Implementation• Initial implementation of LexInfo Model as API – Univ. Bielefeld, DERI –

National Univ. of Ireland, Galway

– https://lexinfo.googlecode.com/svn

Future Work Linguistically enriched ontology for improvement of service annotation The linguistically enriched lexicon associated with the myGrid ontology can

improve the performance of literature based approaches for automatic annotation of bioinformatics web services.

Optimization of LexInfo model by including WordNet etc. To generate all possible lexicalizations of given ontological constructs by

utilizing Synsets from WordNet and extract semantically similar verbs from VerbNet and FrameNet

LexInfo API is currently under development Allows the creation, management and serialization of ontology lexica according to the

LexInfo model. An early prototype of a lexicon generation service based on LexInfo model is also made available. Available at: http://code.google.com/p/lexinfo/

Acknowledgments• Supported in part by the European Union under Grant No. 248458 for the Monnet

project as well as by the Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).

• Thanks to Thomas Wangler, Michael Sintek and Matthias Mantel for their valuable contributions in designing the LexInfo model and developing the LexInfo API.

References• Afzal, H., Stevens, R. and Nenadic, G. Mining Semantic Descriptions of Bioinformatics Web

Resources from the Literature, In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), LNCS 5554, Springer-Verlag: 535-549.

• Afzal, H., Stevens, R., Nenadic, G. Towards Semantic Annotation of Bioinformatics Services: Building a Controlled Vocabulary, In Proceedings of the Third International Symposium on Semantic Mining in Biomedicine (SMBM 2008):5-12.

• Buitelaar, P., Declerck, T., Frank, A., Racioppa, S., Kiesel, M., Sintek, M., Engel, R., Romanelli, M., Sonntag, D., Loos, B., Micelli, V., Porzel, R. and Cimiano, P. LingInfo: Design and Applications of a Model for the Integration of Linguistic Information in Ontologies. In Proceedings of OntoLex06, a workshop at LREC, Genoa, Italy.

• Paul Buitelaar, Philipp Cimiano, Peter Haase, Michael Sintek: Towards Linguistically Grounded Ontologies. In Proceedings of the 6th European Semantic Web Conference (ESWC 2009), Lecture Notes in Computer Science, Springer 2009.

• Cimiano, P., Haase, P., Herold, M., Mantel, M. and Buitelaar, P.: LexOnto: A model for ontology lexicons for ontology-based NLP. In Proceedings of the OntoLex (From Text to Knowledge: The Lexicon/Ontology Interface) workshop at ISWC07 (International Semantic Web Conference).

• Francopoulo, G., Bel, N., Georg, Calzolari, N., Monachini, M., Pet, M. and Soria, C.: Lexical markup framework: ISO standard for semantic information in NLP lexicons. In Proceedings of the Workshop of the GLDV Working Group on Lexicography at the Biennial Spring Conference of the GLDV

Resources Used• BMC Bioinformatics: http://www.biomedcentral.com/bmcbioinformatics/• Genia Tagger: http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/• Stanford Parser: http://nlp.stanford.edu/downloads/lex-parser.shtml• Stanford Tagger: http://nlp.stanford.edu/software/tagger.shtml• TreeTagger: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

Documents

Generating Lexical Information for Terminologyin a Bioinformatics Ontology