Upload
jerry-johnson
View
36
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Infrastructural Language Resources & Standards for Multilingual Computational Lexicons Nicoletta Calzolari … with many others Istituto di Linguistica Computazionale - CNR - Pisa [email protected]. The ENABLER Mission. - PowerPoint PPT Presentation
Citation preview
Infrastructural Language Resources & Standards for Multilingual Computational Lexicons Nicoletta Calzolari with many others
Istituto di Linguistica Computazionale - CNR - Pisa
Pisa, September 2004
The ENABLER MissionLanguage Resources (LRs) & Evaluation: central component of the linguistic infrastructure
LRs supported by national funding in National Projects
Availability of LRs also a sensitive issue, touching the sphere of linguistic and cultural identity, but also with economical and political implications
The ENABLER Network of National initiatives, aims at enabling the realisation of a cooperative framework
formulate a common agenda of medium- & long-term research priorities contribute to the definition of an overall framework for the provision of LRs
Pisa, September 2004
towards .Only Combining the strengths of different initiatives & communitiesExploiting at best the modus operandi of the national funding authorities in different national situationsResponding to/anticipating needs and priorities of R&D & industrial communitiesPromoting the adoption of [de facto] standards, best practicesWith a clear distinction of tasks & roles for different actors
We can produce the synergies, economy of scale, convergence & critical mass necessary to provide the infrastructural LRs needed to realise the full potential of a multilingual global information society
Pisa, September 2004
Lexicon and Corpus:a multi-faceted interactionL CtaggingC Lfrequencies (of different linguistic objects)C Lproper nouns, acronyms, L Cparsing, chunking, C Ltraining of parsers C Llexicon updating C Lcollocational data (MWE, idioms, gram. patterns ...)C Lnuances of meanings & semantic clusteringC L acquisition of lexical (syntactic/semantic) knowledge L Csemantic tagging/word-sense disambiguation (e.g. in Senseval)C Lmore semantic information on LEC Lcorpus based computational lexicographyC Lvalidation of lexical modelsC LL C...
Pisa, September 2004
...Language as a ContinuumInteresting - and intriguing - aspects of corpus use:
impossibility of descriptions based on a clear-cut boundary betw. what is admitted and what is not
in actual usage, language displays a large number of properties behaving as a continuum, and not as properties of yes/no type
the same is true for the so-called rules, where we find more a tendency towards rules than precise rules in corpus evidence
difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary BUT Lexicon & Corpus as two viewpoints on the same ling. object. even more in a multilingual context
Pisa, September 2004
Extraction from texts vs.formal representation in lexicons
It is difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary
The rigour and lack of flexibility of formal representation languages causes difficulties when mapping into it NL word meaning, ambiguous and flexible by its own nature
No clear-cut boundary when analysing many phenomena: its more a continuum
The same impression if one looks at examples of types of alternations:no clear-cut classes across languagesor within one language
Pisa, September 2004
Correlation between different levels of linguistic description in the design of a lexical entryTo understand word-meaning:
Focus on the correlation between syntactic and semantic aspects
But other linguistic levels - such as morphology, morphosyntax, lexical cooccurrence, collocational data, etc. - are closely interrelated/involved
These relations must be captured when accounting for meaning discrimination
The complexity of these interrelationships makes semantic disambiguation such a hard task in NLP
Textual corpora as a device to discover and reveal the intricacy of these relationshipsFrame/SIMPLE semantics as a device to unravel and disentangle the complex situation into elementary and computationally manageable pieces
Pisa, September 2004
towards Corpus based Semantic Lexicons at least in principleboth in the design of the model , &in the building of the lexicon (at least partially)
with (semi-)automatic means
Design of the lexical entry with a combined approach:
theoretical: e.g. Fillmore Frame Semantics/ Pustejovsky Generative Lexicon, empirical: Corpus evidence
even if: not always there are sound and explicit criteria for classification according to frame elements/qualia relations/...
Pisa, September 2004
But they will never be completeSemantic networks: Euro-/ItalWordNetLexicons: PAROLE/SIMPLE/CLIPSTreeBanks Infrastructure of Language Resources...Lexical acquisition systems (syntactic & semantic) from corporaInfrastructure of tools
Robust morphosyntactic & syntactic analysersWord-sense disambiguation systemsSense classifiers......staticdynamicInternational Standards
Pisa, September 2004
ItalWordNet Semantic Network[Italian module of EuroWordNet]~ 50.000 lemmas organized in synonym groups (synsets), structured in hierarchies & linked by ~ 130.000 semantic relations
~ 50.000 hyperonymy/hyponymy relations~ 16.000 relations among different POS (role, cause, derivation, etc..)~ 2.000 part-whole relations~ 1.500 antonymy relations, etc.
Synsets linked to the InterLingual Index (ILI=Princeton WordNet),
Through the ILI link to all the European WordNets (de-facto standard) & to the common Top Ontology
Possibility of plug-in with domain terminological lexicons(legal, maritime)
Usable in IR, CLIR, IE, QA, ...
Pisa, September 2004
EuroWordNet Multilingual Data Structure
Pisa, September 2004
hond
dog
cane
perro
dog
Italian
WN
Spanish
WN
TOP
ONTOLOGY
Dutch
WN
English
WN
Living
Animal
ILI
Human
French
WN
German
WN
Estonian
WN
Czech
WN
{Casa, abitazione, dimora }Hyperonym: {edificio,..}Hyponym:{villetta }{catapecchia, bicocca, .. }{cottage}{bungalow }
Role_location: {stare, abitare, ...}Role_target_direction: {rincasare}Role_patient: {affitto, locazione}Mero_part: {vestibolo} {stanza}Holo_part: {casale} {frazione} {caseggiato}home, domicile, ..house TOP Concepts:Object,Artifact,BuildingSynsets linkedby Semantic Relations in ItalWordNet
Pisa, September 2004
Jur-WordNetWith ITTG-CNR (Istituto di Teoria e Tecniche dellinformazione Giuridica)
Jur-WordNet Extension for the juridical domain of ItalWordNetKnowledge base for multilingual access to sources of legal information
Source of metadata for semantic mark-up of legal texts
To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc.
Pisa, September 2004
Terminological Lexicon of Navigation & Sea Transportation NoloSynsets 1.614Lemmas 2.116Senses 2.232Nouns 1.621Verbs 205Adjectives 35Proper Nouns 236
Pisa, September 2004
PAROLEItal. Synt. Lex.96-98SIMPLEItal. Sem. Lex.98-2000CLIPS2000-2004morphology: 20,000 entriessyntax: 20,000 words semantics: 10,000 senses
phonologymorphology 55,000 words syntaxsemantics: 55,000 sensesSGMLSGMLXMLPAROLE/SIMPLE12 harmonised computational lexiconshttp://www.ilc.cnr.it/clips/
Pisa, September 2004
machine language learning
Pisa, September 2004
machine language learningdevelopment of conceptual networkslinguistic learningadaptive classification systemsinformation extractionbootstrapping of grammars linguistic change modelslanguage usage modelsbootstrapping of lexical information
Pisa, September 2004
lexicaunstructuredtextdataannotationtoolsannotateddatamachine learningfor linguistic knowledge acquisitionlexicacross-lingualinformationretrievalmulti-lingualinformationextractionmulti-lingual textmining
userneeds
lexiconmodelArchitecture for linguistic knowledge acquisition ...LKG. towards dynamic lexicons, able to auto-enrichterminology
Pisa, September 2004
Harmonisation:More & more Need of a Global Viewfor Global InteroperabilityIntegration/sharing of data & software/tools Need of compatibility among various componentsAn exemplary cycle:
FormalismsGrammarsSoftware: Taggers,Chunkers, Parsers, Representation Annotation Lexicon Corpora TerminologySoftware: Acquisition SystemsI/O InterfacesLanguages
Pisa, September 2004
A short guide to ISLE/EAGLES
http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm
Multilingual Computational Lexicon Working Group
Pisa, September 2004
Target: the Multilingual ISLE Lexical Entry (MILE)General methodological principles (from EAGLES):
high granularity: factor out the (maximal) set of primitive units of lexical info (basic notions) with the highest degree of inter-theoretical agreementmodular and layered: various degrees of specification possibleexplicit representation of info allow for underspecification (& hierarchical structure)leading principle: edited union of existing lexicons/models (redundancy is not a problem)open to different paradigms of multilingualityoriented to the creation of large-scale & distributed lexicons
Pisa, September 2004
Paths to Discover theBasic Notions of MILEclues in dictionaries to decide on target equivalentguidelines for lexicographersclues (to disambiguate/translate) in corpus concordanceslexical requirements from various types of transfer conditions & actions in MT systemslexical requirements from interlingua-based systems
Pisa, September 2004
Designing MILE
Steps towards MILE:
Creating entries (Bertagna, Reeves, Bouillon) Identifying the MILE Basic Notions (Bertagna,Monachini,Atkins,Bouillon)Defining the MILE Lexical Model (Lenci, Calzolari, etc.)Formalising MILE (Ide)Development of the ISLE Lexical Tool (Bel)ISLE & spoken language & multimodality (Gibbon)Metadata for the lexicon (Peters, Wittenburg)A case-study: MWEs in MILE (Quochi, lenci, Calzolari)the MILE Basic Notionsthe MILE Lexical Model
Pisa, September 2004
The MILE Basic Notions (the EAGLES/ISLE CLWG)Basic lexical dimensions & info-types relevant to establish multilingual linksTypology of lexical multilingual correspondences (relevant conditions & actions)
Identified by:
creating sample multilingual lexical entries (Bertagna, Reeves)
investigating the use of sense indicators in traditional bilingual dictionaries (Atkins, Bouillon).
Pisa, September 2004
The MILE Lexical Classes Data Categories for Content InteroperabilityFrancesca Bertagna*, Alessandro Lenci, Monica Monachini*, Nicoletta Calzolari*
*ILCCNR Pisa Pisa University
Pisa, September 2004
OverviewMILE Lexical Model with Lexical Objects and Data CategoriesMapping of existing lexicons onto MILERDF schema and DC Registry for some pre-instantiated lexical objects together with a sample entry from the PAROLE-SIMPLE lexicons in MILEFuture
Pisa, September 2004
The MILE Lexical ModelGENELEXModelPAROLE-SIMPLELexiconsMultilingualLexicons(EuroWordNet, etc.)MILE Lexical ModelGuidelines syntactic semantic lexicons where after?
Pisa, September 2004
The MILE Main FeaturesA general architecture devised as a common representational layer for multilingual Computational Lexiconsboth for hand-coded and corpus-driven lexical data
Key features:ModularityGranularity Extensibility and openess - User-adaptabilityResource SharingContent InteroperabilityReusability
Semantic Web technologies & standards applied at Lexicon modelling
Pisa, September 2004
The MILE Lexical Model (MLM)The MLM core is the Multilingual ISLE Lexical Entry (MILE)a general schema for multilingual lexical resourcesa lexical meta-entry as a common representational layer for multilingual lexiconsComputational lexicons can be viewed as different instances of the MILE schemaMILELexical Modellexicon#1lexicon#3lexicon#2
Pisa, September 2004
MILEthe building-block modelThe MILE architecture is designed according to the building-block model:Lexical entries are obtained by combining various types of lexical objects (atomic and complex)Users design their lexicon by:selecting and/or specifying the relevant lexical objectscombine the lexical objects into lexical entriesLexical objects may be shared: within the same lexicon (intra-lexicon reusability)among different lexicons (inter-lexicon reusability)
Pisa, September 2004
MILEthe building-block model
Pisa, September 2004
Modularity in MILEmulti-MILEmultilingualcorrespondenceconditionsmultiple levels of modularityHorizontal organization, where independent, but interlinked, modules allow to express different dimensions of lexical entries
Pisa, September 2004
The Mono-MILEEach monolingual layer within Mono-MILE identifies a basic unit of lexical descriptionmorphological layerMUbasic unit to describe the inflectional and derivational morphological properties of the wordsyntactic layerSynUbasic unit to describe the syntactic behaviour of the MUsemantic layerSemUbasic unit to describe the semantic properties of the MU
Pisa, September 2004
The Mono-MILEMUWithin each layer, a basic linguistic information unit is identified
Pisa, September 2004
Granularity in MILEConcerns the vertical dimension. Within a given lexical layer, varying degrees of depth of lexical descriptions are allowed, both shallow and deep lexical representations
Pisa, September 2004
Defining the MLMThe MLM is designed as an E-R model (MILE Entry Schema)defines the lexical objects and the ways they can be combined into a lexical entryThe MLM includes 3 types of lexical objects:MILE Lexical Classes (MLC)MILE Lexical Data Categories (MDC)MILE Lexical Operations (MLO)
Pisa, September 2004
The MILE Lexical ObjectsWithin each layer, basic lexical notions are represented by lexical objects:MILE Lexical Classes MLCMILE Data Categories MDCLexical operationsThey are an ontology of lexical objects as an abstraction over different lexical models and architectures
Pisa, September 2004
The MILE E/R diagramsThe lexical objects are described with E-R diagrams which define them and the ways they can be combined into a lexical entry
Pisa, September 2004
MILE Lexical Objects: Syntactic LayerMLC:SynUMLC:SyntacticFramehasSyntacticFrameMLC:FrameSethasFrameSetMLC:CompositioncomposedbycorrespondToMLC:SemUMLC:CorrespSynUSemU1..****
Pisa, September 2004
SyntacticFrameConstructionSelfSlotSlotSynUFunctionPhrase expanding one node.
Pisa, September 2004
MLC:SemUMLC:SynsetbelongsToSynsetMLC:SemanticFramehasSemFrameMLC:SemanticFeaturehasSemFeatureMLC:CollocationhasCollocationsemanticRelationMLC:SemUMLC:SemanticRelationMILE Lexical Objects: Semantic Layer*0..1***
Pisa, September 2004
MLC:CorrespSynUSemUMLC:SynUhasSourceSynuhasTargetSemuMLC:SemUhasPredicativeCorrespMLC:PredicativeCorrespIncludesSlotArgCorrespMLC:SlotArgCorrespMILE Lexical Objects: Synt-Sem Linking1110..*
Pisa, September 2004
Syntax-Semantics LinkingCorrespSynUSemUPredCorresp
Slot0:Arg1Slot1:Arg0
Pisa, September 2004
Syntax-Semantics LinkingJohn gave the book to MaryJohn gave Mary the bookSynU#1obj_NPobl_PP_toSemU#1Semantic_Frame:GIVEArg1Agentsubj_NPSynU#2obj_NPobj_NPsubj_NPArg2ThemeArg3Goal
Pisa, September 2004
CorrespSynUSemUSyntax-Semantic Linking in SIMPLETransitive structure Slot0 Slot1 SemU1_migliorareSemU2_migliorareCHANGE_OF_STATECAUSE_CHANGE_OF_STATEPRED_ migliorareARG0:Agent ARG1:Patient isomorphic non-isomorphic SynU_migliorare
FramesetIntransitive structure Slot0 CorrespSynUSemUSlotArgCorrespSlotArgCorresp
Pisa, September 2004
MultiCorrespMUMUCorresphasMUMUCorrSynUSynUCorresphasSynUSynuCorrSemUSemUCorresphasSemUSemUCorrSynsetMultCorresphasSynsetMultCorrhasSemFrameCorrSemanticFrameMultCorrespThe Multilingual layer1..01..01..01..01..0
Pisa, September 2004
MILE approach to multilingualityOpen to various approaches transfer-basedmonolingual descriptions are used to state correspondences (tests and actions) between source and target entriesinterlingua-based monolingual entries linked to language-independent lexical objects (e.g. semantic frames, primitive predicates, etc.)
Pisa, September 2004
The Multi-MILEMulti-MILE specifies a formal environment to express multilingual correspondences between lexical itemsSource and target lexical entries can be linked by exploiting (possibly combined) aspects of their monolingual descriptionsmonolingual lexicons act as pivot lexical repositories, on top of which language-to-language multilingual modules can be defined
Pisa, September 2004
The Multi-MILEMulti-MILE may include:Multlingual operations to establish transfer links between source and target mono-MILEMultlingual lexical objectsenrich the source and target lexical descripotions, butdo not belong to the monolingual lexiconsLanguage-independent lexical objects:Primitive semantic frames, interlingual synsets, etc.Relevant for interlingua approaches to multilinguality
Pisa, September 2004
Multi-MILEIT_SemU_2 En_SemU_1IT_SynU_2 En_SynU_1IT_Slot_0 EN_Slot_1IT_Slot_1 EN_Slot_0AddFeature to source SemU+HUMANAddSlot to target SynUMODIF [PP_with]
Pisa, September 2004
Multi-MILEditofingertoemodif(mano)modif(piede)multilingual conditionsrun + PP_intoentrareto enter+PP_di_corsamultilingual conditionsIT LexiconEN Lexicon
Pisa, September 2004
MILE Lexical ClassesRepresent the main building blocks of lexical entriesFormalize the MILE Basic NotionsDefine an ontology of lexical objectsrepresent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc. Similar to class definitions in OO languagesspecify the relevant attributesdefine the relations with other classeshierarchically structured
Pisa, September 2004
MILE Lexical Classesan ontology of lexical objects
Pisa, September 2004
MILE Lexical Data CategoriesMDC are instances of the MILE lexical ClassesCan be used off the shelf or as a departure point for the definition of new or modified categoriesEnable modular specification of lexical entities using all or parts of the lexical information in the repositoryEach MDC respresents a resource uniquely identified by a URITwo types of MDC:Core MDCbelong to shared repositories (Lexical Data Category Registry)lexical objects and linguistic notions with wide consensusUser Defined MLDCuser-specific or language specific lexical objects
Pisa, September 2004
The MILE Data CategoriesInstances of the MILE Lexical Classes are Data CategoriesMDC can belong to a shared repository or be user-defined
User-defined MDC
CoreMDC
MLC
Pisa, September 2004
The MILE Data Categories User-adaptability and extensibilityHUMANARTIFACTEVENTANIMALGROUPAGEMAMMALinstance_ofCoreUserDefinedMLC:SemanticFeature
Pisa, September 2004
MILE Lexical Data CategoriesMLM:FeatureMLM:GrammaticalFunction
Pisa, September 2004
MILE Lexical OperationsThey are used to state conditions and perform operations over lexical entriesLink syntactic slots and semantic argumentsConstrain the syntax-semantic linkExpress tests and actions in the transfer conditions in the multi-MILEThey provide the glue to link various independent intra-lexical and inter-lexical components
Pisa, September 2004
Multilingual OperationsSource-to-target language transfer conditions can be expressed by combining multilingual operationsThree types of multingual operations:Multilingual correspondencesLink a source lexical object (MU, SemU, SynU, semantic argument, syntactic slot) and a target lexical object (MU, SemU, SynU, semantic argument, syntactic slot)Add-operationsAdd lexical information relevant for the cross-lingual link, but not present in the source or target mono-MILEConstrain-operationsConstrain the transfer link to some portions of source and target mono-MILE
Pisa, September 2004
Defining the MLMMILEEntry SchemaMILE LexicalClassesRDF/SDescriptions
Pisa, September 2004
RDF Instantiation of the MLMLexicon#1Lexicon#2Lexicon#3ResourcesLexicalObjectsLexicalClassesLexicalData CategoriesResourcesMetadata
Pisa, September 2004
MILE Lexical ModelIdeal structure for rendering in RDF:hierarchy of lexical objects built up by combining atomic data categories via clearly defined relationsProof of concept:Create an RDF schema for the MILE Lexical Modelversion 1.2Instantiate MILE Lexical Data Categories
Pisa, September 2004
User-Adaptability and Resource Sharing in MILECompatible with different models of lexical analysis:Relational semantic models (e.g. WordNet)Syntactic and semantic framesOntology-based lexiconsCompatible with different degrees of specification:Deep lexical representations (e.g. PAROLE-SIMPLE)Terminological lexiconsCompatible with different paradigm of multilingualityLexicons for Transfer Based MTInterlingua-based lexicons
Pisa, September 2004
The MILE Lexical ModelMILELexical Model
Pisa, September 2004
RDF Instantiation of the MLMEnable universal access to sophisticated linguistic infoProvide means for inferencing over lexical info Incorporate lexical information into the Semantic Web
W3C standards:Resource Definition Framework (RDF) Ontology Web Language (OWL) Built on the XML web infrastructure to enable the creation of a Semantic Webweb objects are classified according to their propertiessemantics of relations (links) to other web objects precisely defined
Pisa, September 2004
The RDF SchemaDefines classes of objects (MLC) and their relations to other objectsLike a class definition in Java, etc.Classes and properties in the schema correspond to the E-R model Can specify sub-classes/sub-properties and inheritance
Pisa, September 2004
GoalsLexical information will form a central component of semantic informationNeed a standardized, machine processable format so that information can be used, merged with othersMain task: get the data model rightSee Semantic Web
Pisa, September 2004
Advantages of RDFModularityCan create instances of bits of lexical information for re-use in a single lexicon or across lexiconsInstances can be stored in a central repository for use by othersCan use partial information or all of itBuilding block approach to lexicon creationWeb-compatibleRDF instantiation will integrate into Semantic WebInferencing capabilities
Pisa, September 2004
ExampleThree parts:RDF Schema for lexical entriesDefines classes and properties, sub-classes, etc.Sample repository of RDF-instantiated lexical objectsThree levels of granularitySample lexicon entriesUse repository information at different levels
Pisa, September 2004
Sample Repositoriesrepository of enumerated classes for lexical objects at the lowest level of granularitydefinition of sets of possible values for various lexical objectsrepository of phrases for common phrase types, e.g., NP, VP, etc.repository of constructions for common syntactic constructions
Pisa, September 2004
Subj Obj Comp Arg Iobj
tense gender control person aux
have be subject_control object_control masculine feminine
Enumerated classes
Pisa, September 2004
Sample LDCR for a Phrase Object
Pisa, September 2004
Sample LDCR entry for a Construction object
Pisa, September 2004
Full entry
John ate the cake Continued
Pisa, September 2004
Continued from previous slide
Pisa, September 2004
Entry Using Phrase John ate the cake
Pisa, September 2004
Entry Using Construction
John ate the cake
Pisa, September 2004
Semantic RepresentationThe data model underlying RDF/UML, etc. is universal, abstract enough to capture all types of infoSemantic representations:Registry of basic data categoriesmeta-categories: addressee, utterance, etc. Information categories: eyebrow movement, gestures, pitch, Supporting ONTOLOGY of information categoriesInterpretative procedures yield another level of meaning represent.Registry of categories.UNINTERPRETED REPRESENATIONINTERPRETATIONPROCESSINTERPRETED REPRESENTATION
Pisa, September 2004
MILE Lexical Data Category Registry (MDC)Instantiation of pre-defined lexical objectsExtension of the shared class schema with lexicon-specific sub-classes and sub-propertiesCan be used off the shelf or as a departure point for the definition of new or modified categories Enables modular specification of lexical entitieseliminate redundancyidentify lexical entries or sub-entries with shared properties
Pisa, September 2004
MLC in RDF/S featuresmlm:LexObjectmlm:Valuesmlm:featuremlm:SemValuesmlm:SynValuesrdfs:subClassOfmlm:semFeaturerdfs:subClassOfmlm:synFeaturerdfs:subPropertyOffeatures are properties of lexical objects
Pisa, September 2004
Synsets in RDF/Smlm:Synsetrdfs:literalmlm:wordmlm:Synsetmlm:synsetRelationmlm:Valuesrdfs:literalmlm:glossmlm:featurecf. also http://www.semanticweb.org/library/wordnet/wordnet-20000620.rdfs
Pisa, September 2004
Synsets in RDF/S
SynsetThis class formalizes the notion of synset as defined in WordNet (Fellbaum 1998).
The WordNet hypernym relation
The WordNet meronym relation
relation between synsetsdifferent types of synset relations
Pisa, September 2004
Foundations of the Mapping Experiment
Pisa, September 2004
1. The MILE building-block modelThe MILE Lexical Classes and the MILE Lexical Data Categories are the main building blocks of the MILE lexical architecture
Building blocks allow two kinds of reusability: intra-lexicon reusability (within the same lexicon) inter-lexicon reusability (among different lexicons)
Pisa, September 2004
How building-blocks work?
Pisa, September 2004
2. MILE: a meta-entryMILE isa general schema for multilingual lexical resourcesa lexical meta-entry, a common representational layer for multilingual lexiconsComputational lexicons can be viewed as different instances of the MILE schema
MILE
lexicon#1lexicon#3lexicon#2
Pisa, September 2004
MILE and Content InteroperabilityThis common shared compatible representation of lexical objects is particularly suited to manipulate objects available in different lexical resourcesunderstand their deep semanticsapply the same operations to lexical objects of the same type
key elements of Content Interoperability
Pisa, September 2004
The Mapping Experiment: Why?It is a concrete experiment aimed to test the expressive potentialities and capabilities of the MILEThe idea is that if the MILE atomic notions combined together in different ways suit the different visions underlying two lexicons such as FrameNet and NOMLEX, the MILE will come out fortified its adoption as an interface between differently conceived lexical architectures can be pushed morekey issues for content interoperability between resources can be addressed
Pisa, September 2004
The mapping scenariosHigh level mapping of the objects of a lexicon into the objects of the abstract model the native structure is maintained and no format conversion is performed
Translate instances of lexical entries directly in MILE acts as a true interchange format
Pisa, September 2004
FrameNet to MILE
Pisa, September 2004
FrameNet-MILE: ObservationsThe mapping is promisingFrame Predicate (primitive) Frame Elements Argument (enlarge the set of possible values)Lexical_Unit SemULink SemU-Predicate (obligatory) should become underspecified
But Lack of inheritance mechanism in the Predicate does not allow to represent the hierarchical organization of Frames and Sub-frames, temporal ordering among Frames, subsumption relations among FramesWe could add a new object PredicateRelation to allow for the description of relations occurring between predicates and sub-predicates
Pisa, September 2004
MLC:SynUMLC:SemUMLC:SemanticFrameTypeOfLinkAgentnomIncludedArg 0
MLC:PredicateMLC:ArgumentMLC:ArgumentMLC:CorrespSynUSemU:nom-type ((subject))
Pisa, September 2004
NOMLEX-MILE: ObservationsThe mapping is promisingNotions represented in NOMLEX have a correspondent in MILE
But .. are expressed with two opposite lexical structuresIn NOMLEX, lexical information is expressed in a very compact wayno clear cut boundaries between the levels of linguistic descriptionIn MILE compressed info should be decompressed and spread over different MILE lexical layers and objects: SynU, SemU, SemanticFrame with its Predicate and relevant Arguments to account for the incorporation of the Agent.
Pisa, September 2004
Lesson Learned from the mappingThe results of the experiments are promisingFrameNet offers the possibility to be confronted with two similar lexical models, but not perfectly overlapping lexical objects test the adequacy of the linguistic objectsNOMLEX gives the opportunity to work with two lexicons where linguistic notions correspond but are expressed with an opposite lexicon structure test the adequacy of the architectural modelThe high granularity and modularity of MILE allow the compatibility with differently packaged linguistic objectsallow the addition of new objects and relations without perverting the general architecture
Pisa, September 2004
RDF and MILE: Why?Some reasons (from Nancy Ide et al. 2003)MILE as a hierarchy of lexical objects built up by combining data categories via clearly defined relations is an ideal structure for rendering in RDFRDF mechanism, with the capacity of expressing named relations between objects, offers a web-based means to represent the MILE architectureRDF representation of linguistic information is an invaluable resource for language processing applications in the Semantic WebRDF description and instantiation is in line with the goal of ISO TC37 SC4
Pisa, September 2004
RDF Representation of MILEMILE was already supplied withan RDF schema for the MILE Syntactic Layeran instantiation of pre-defined syntactic objectsWe increased the repository of shared lexical objects with the RDF description and (partial!) instantiations of the objects of the semantic and linking layersThis has been carried out with the intent to be submitted within the ISO TC37/SC4foster the adoption of MILE, by offering a library of RDF objects ready-to-use
Pisa, September 2004
An RDF Schema for the synt-sem linking
CorrespSynUSemU This class links a SynU to a SemU
PredicativeCorresp This class contains the associations between the syntactic slots and semantic argument
SlotArgCorresp This class links a syntactic slots to a semantic argument Classes
Pisa, September 2004
An RDF Schema for the synt-sem linking
hasSourceSynU
hasTargetSemU
hasPredicativeCorresp
includesSlotArgCorresp
Properties
Pisa, September 2004
The library of Pre-instantiated objectsEnable modular specification of lexical entitieseliminate redundancyidentify lexical entries or sub-entries with shared propertiescreate ready-to-use packages that can be combined in different waysCan be used off the shelf or as a departure point for the definition of new or modified categories
Pisa, September 2004
MDCR for some objects
A Sample Entry in MILE The entry is shown in a double alternative: the full specification of a lexical object PredicativeCorrespan already instantiated object PredicativeCorrespThe advantage is that the object does not need to be specified in the entry and can be used and reused in other entriesexplore the potential of MILE for representation of lexical data
Pisa, September 2004
Sample full entry for amareV
The full object PredicativeCorresp
Pisa, September 2004
the abbreviated entry
Instantiated object PredicativeCorresp
Pisa, September 2004
The RDF Schema, the DCR for MILE objects and the entries are available atwww.ilc.cnr.it/clips/rdf/
Pisa, September 2004
and INTERA? INTERA Multilingual Terminological Lexica will follow and merge the two frameworks
The MILE and ISO TMF (Terminological Markup Framework)
Pisa, September 2004
Beyond MILE: future workMILE Lexical Model oriented towards an Open Distributed Lexical Infrastructure:
Lexical Information Servers for multiple access to lexical information repositoriesEnhance user-adaptivityresource sharingcooperative creationDevelop integration and interchange tools
Pisa, September 2004
Broadening MILE: ... other languagesOngoing enlargement to Asian languages (Chinese, Japanese, Korean, Thai, Hindi ...)promote common initiatives between Asia & Europe (e.g. within the EU 6th FP)The creation of an Open Distributed Lexical Infrastructure, also supported by Asian Institutions: AFNLPUniversity of Tokyo (Dept. of Computer Science)Korean KAIST and KORTERMAcademia Sinica (Taiwan)
To valorise results & increase visibility of LR & standardisation initiatives in a world-wide context, while concretely promoting the launching of a new common platform for multilingual LR creation & management
Pisa, September 2004
Using semantically tagged corpora to acquire semantic info and enhance Lexicons evaluate the disambiguating power of the semantic types of the lexiconassess the need of integrating lexicons with attested senses and/or phraseologyidentify the inadequacy of sense distinctions in lexiconscheck actual frequency of known senses in different text typeshave a more precise and complete view on the semantics of a lemma identify the most general sensescapture the most specific shifts of meaning
Capture just the core, basic distinctions in a core lexicon Corpus analysis must not lead to excessive granularity of sense distinctions, but draw a distinction between sense discrimination to be kept under control - clustering (manually or automatically) additional, more granular information (often of collocational nature) which can/must be acquired/encoded within the broader senses, e.g. to help translation
Pisa, September 2004
Dynamic lexiconCurrent computational lexicons (even WordNets) are static objects, still shaped on traditional dictionaries suffering from the limitations induced by paper support
Thinking at the complex relationships between lexicon and corpus towards a flexible model of dynamic lexicon extending the expressiveness of a core static lexicon adapting to the requirements of language in use as attested in corpora with semantic clustering techniques, etc.
Convert the extreme flexibility & multidimensionality of meaning into large-scale and exploitable (VIRTUAL?) resourcesa Lexicon and Corpus together
Pisa, September 2004
What to annotate?Mix of:Word-sense annotation (implicit semantic markup)Semantic/conceptual markup
Syntagmatic relationsDependency relations Semantic roles
Pisa, September 2004
Need for a common Encoding Policy ?Agree on common policy issues? is it feasible? desirable? to what extent?
This would imply, among others:
analysis of needs also applicative/industrial - before any large development initiative base semantic tagging on commonly accepted standards/guidelines ??up to which level?Common semantic tagset: Gold Standard??
build a core set of semantically tagged corpora, encoded in a harmonised way, for a number of languages??make annotated corpora available to the community by largeinvolve the community, collect and analyse existing semantically tagged corpora devise common set of parameters for analysis
Pisa, September 2004
A few Issues for discussion:MILE & lexicon standardsMore standardisation initiatives?MILE - a general schema for encoding multilingual lexical info, as a meta-entry, as a common representational layer Short & medium term requirements wrt standards for multilingual lexicons and content encoding, also industrial requirementsRelation with Spoken language community (see ELRA)Semantic Web standards & the needs of content processing technologies: importance of reaching consensus on (linguistic & non-linguistic) content, in addition to agreement on formats & encoding issues (words convey content & knowledge)Define further steps necessary to converge on common priorities
Pisa, September 2004
Broadening MILE: ... other communitiesNLP, lexicons, terminologies, ontologies, Semantic Web: a continuum?
Knowledge management is critical. For content interoperability, need to converge around agreed standards also for the semantic/conceptual level is the field mature enough to converge around agreed standards also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?Is the field of multilingual lexical resources ready to tackle the challenges set by the Semantic Web development?
Foster better integration with corpus-driven dataterminology/ontology/semantic web communitiesmultimodal & multimedial aspectsOriented towards open, distributed lexical resources:Lexical Information Servers for multiple access to lexical information repositories
Pisa, September 2004
A few Issues for discussion:NLP, lexicons, content, ontologies, Semantic Web: a continuum?Need for robust systems, able to acquire/tune multilingual lexical/linguistic/conceptual knowledge, to auto-enrich static basic resourcesRelation betw. lexical standards & acquisition & text annotation protocols
Pisa, September 2004
Target.. Multilingual Knowledge Management Technical Feasibility:
Prerequisite: is it an achievable goal a commonly agreed text/lexicon annotation protocol also for the semantic/conceptual level (to be able to automatically establish links among different languages)?
Yes, at the lexical level
More complex, for corpus annotation?
EAGLES/ISLE
Pisa, September 2004
To make the Semantic Web a reality ...need to tackle the twofold challenge of content availability & multilinguality
Natural convergence with HLT:multilingual semantic processingontologiessemantic-syntactic computational lexicons
Pisa, September 2004
enables a new role of Multilingual Lexicons: to become essential component for the Semantic WebLanguage - & lexicons - are the gateway to knowledgeSemantic Web developers need repositories of words & terms - & knowledge of their relations in language use & ontological classificationThe cost of adding this structured and machine-understandable lexical information can be one of the factors that delays its full deploymentThe effort of making available millions of words for dozens of languages is something that no small group is able to afford
A radical shift in the lexical paradigm - whereby many participants add linguistic content descriptions in an open distributed lexical framework - required to make the Web usable
Pisa, September 2004
Beyond MILE: next steps... . towards an Open Distributed Lexical InfrastuctureCreate a first repository of shared lexical entries extracted from different lexical resources & mapped to MILE (choosing e.g. lexical entries in areas related to the Olympic Games)to test mapping different lexicon models to MILEprovide a grid with all the ISLE Basic Notions, short descriptions, attributes and sub-elements,to be filled with the correspondent "notionsCreate a list (Open Lexicon Interest Group)
...LanguageEnhance user-adaptivity, resource sharing, cooperative creation & managementLexical Information Servers for multiple access to lexical information repositoriesKnowledge
Pisa, September 2004
A new paradigm for a new generation of LR?
New Strategic Vision
towards a Distributed Open Lexical Infrastructure
Focus on cooperation,
also between different communities for distributed & cooperative creation, management, etc. of Lexical Resources MILE as a common platform
technical & organisational requirements
Pisa, September 2004
Beyond MILE: towards open & distributed lexiconsSemantic LexiconURI = http://www.xxxSyntactic ConstructionsURI = http://www.yyyOntologyURI = http://www.zzzMonolingual/Multilingual LexiconLex_object: semFeatureURI = http://www.xxx#HUMANLex_object: syntagmaNTURI = http://www.zzz#NPcorpora
Pisa, September 2004
A few issues for the future...Integration betw. WLR/SLR/MMR (see e.g. LREC)
Integration betw. LRs & SemWeb
Integration of Lexicons/Terminologies/Ontologies: towards Knowledge Resources
Multilingual Resources: an open infrastructure
Integration of Lexicon/Corpus (see e.g. Framenet)
Parallel evolution of LRs & LTechnology
Pisa, September 2004
from Computational Lexicons to Knowledge ResourcesUnified framework for lexicons, ontologies, terminologies, etc.
Towards an open, distributed infrastructure for lexical resourcesLexical Information Serversflexible and extensibleintegrated with multimodal and multimedial dataintegrated with Web technologyrelated initiatives: INTERA, ICWLRE
Pisa, September 2004
with a world-wide participation looking for an appropriate call
.. pushing to launch an Open & Distributed Lexical Infrastructure
for content description and content interoperability,
to make lexical resources usable within the emerging Semantic Web scenario
for Language Resources & Semantic Web.
Pisa, September 2004
How to go to a framework allowing incremental creation/merging/How to:"organise" creation/acquisition of multilingual LRs: evaluate different modelscope with/affect maintenanceorganise technology transfer among languagessupport BLARK (a commonly agreed list of minimal requirements for national LRs)launch an international initiative linking Semantic Web & LRsbootstrap this by "opening" a few LRsrole of standards
Pisa, September 2004
Lexical WEB & Content InteroperabilityAs a critical step for semantic mark-up in the SemWeb
ComLexSIMPLEWordNetsWordNetsWordNetsFrameNetLex_xLex_yMILEwith intelligent agents??NomLex
Pisa, September 2004
A new paradigm for a new generation of LRs?Cross-linguallinks
Pisa, September 2004
The Italian PAROLE and SIMPLE lexicons constitute the basis for the CLIPS lexiconthat is being enlarged with a set of lexical units selected from the PAROLE corpusAt the end of the project, the CLIPS lexicon will consist of 55,000 lemmas encoded at the phonological, morphological andsyntactic level and of 55,000 semantic units.
Now, I would like to focus on two aspects of CLIPS:the link between syntactic and semantic information andthe way the information encoded in the Extended Qualia Structure can be exploitedThen, the predicate linked to the semantic units is related to the syntactic frame, and more precisely EACH SEMANTIC ARGUMENT, WITH ITS BUNDLE OF INFORMATION, IS RELATED TO THE CORRESPONDING FRAME POSITION OF THE RELEVANT SYNTACTIC STRUCTURE HenceThrough the cause change-typed Semantic unit, the predicate is related to the transitive syntactic structure by means of a bivalent isomorphic relation holding between arguments and syntactic positions, while through the change-typed one it is linked to the intransitive structure through a non-isomorphic relation indicating that: (ARG0:agent) does not map on any syntactic position while (ARG1:patient) maps on P0.