26
Standards for language resources the ISO/TC 37(/SC 4) perspective Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair

Standards for language resources the ISO/TC 37(/SC 4) perspective

  • Upload
    bona

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Standards for language resources the ISO/TC 37(/SC 4) perspective. Laurent Romary Directeur de Recherche INRIA ISO/TC 37/SC 4 chair. Context. ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology ISO 12200 - Martif - PowerPoint PPT Presentation

Citation preview

Page 1: Standards for language resources the ISO/TC 37(/SC 4) perspective

Standards for language resources the ISO/TC 37(/SC 4) perspective

Laurent Romary

Directeur de Recherche INRIA

ISO/TC 37/SC 4 chair

Page 2: Standards for language resources the ISO/TC 37(/SC 4) perspective

Context

ISO TC37 - Terminology and other language resources SC3 - Computer applications in terminology

ISO 12200 - Martif ISO 12620 - Data categories (under revision) ISO 16642 - TMF (Terminological Markup

Framework) SC4 - Language Resource Management

www.tc37sc4.org

Page 3: Standards for language resources the ISO/TC 37(/SC 4) perspective

An example scenario: information extraction

Part-of-speechtagging

Primary Data

Semantic content

Content analysis

Chunk parsing

POS tagging

Syntactic structures

Page 4: Standards for language resources the ISO/TC 37(/SC 4) perspective

Horizontal view(W3C perspective)

XML

SOAP

RDF

OWL

Part-of-speechtagging

Primary Data

Semantic content

Content analysis

Chunk parsing

POS tagging

Syntactic structures

Page 5: Standards for language resources the ISO/TC 37(/SC 4) perspective

Vertical view(ISO/TC 37/SC 4 perspective)

Evaluation

Lexica

Linguisticmodels anddescriptors

(DataCategories) Part-of-speech

tagging

Primary Data

Semantic content

Content analysis

Chunk parsing

POS tagging

Syntactic structures

Page 6: Standards for language resources the ISO/TC 37(/SC 4) perspective

Linguistic information sources …and initiatives

Primary resources(text, dialogues)

Structural mark-upBasic annotations

[TEI, MPEG7, TMX,XLIFF, XHTML, etc.]

NLP structures(annotations)POS tagging

Chunks (cf. Named Entities)Deep Syntactic structures

Co-references etc.[Eagles/ISLE,

CES, MATE,…]

Knowledge structuresHierarchies of types

Relations between concepts(subjects/topics etc.)

Links to primary resources[Topic Maps, OIL, RDF]

Lexical structures(Language models)

TerminologiesTransfer lexica

LTAG/HPSG/LFG lexica[TBX, OLIF,

Eagles/ ISLE (Genelex)]

Links

Meta-data[Dublin core, OLAC,ISLE, MPEG7, RDF]

Access protocols[Corba, SOAP]

Page 7: Standards for language resources the ISO/TC 37(/SC 4) perspective

SC4 Approach

Efforts geared towards defining abstract models and general frameworks for the creation and representation of language resources

In principle, abstract enough to accommodate diverse linguistic, theoretical or practical approaches

No provision of new formats Situate development squarely in the framework of XML

and related standards Ensure compatibility with established and widely accepted web-

based technologies Ensure feasibility of transduction from legacy formats into newly

defined formats

Page 8: Standards for language resources the ISO/TC 37(/SC 4) perspective

--------------------

SC4 and other standardizing bodies

W3C-basic protocols and formatsXML (Schemas)XPathXPointer+ RDF, SVG, SMIL, SOAP

MPEG- Multimedia, XML basede.g. MPEG7-4Word and phone lattices

ISO TC37/SC4- language resources, NLP perspectivee.g. linguistic annotations,lexical formats

TEI-text representationReference for primary sourcese.g.: text archives

Text

Audio/Speech

Technical background

Oscar

Contributing organizations

Page 9: Standards for language resources the ISO/TC 37(/SC 4) perspective

ISO/TC 37/SC 4 structure

WG1Basic descriptors and mechanisms

for language resources

WG2Representation schemes

WG3Multilingual text representation

WG4Lexical databases W

G5

Workflow

of language Resource M

anagement

Datacategories

Page 10: Standards for language resources the ISO/TC 37(/SC 4) perspective

On-going activities

Feature structure representation (in collaboration with the TEI - Text Encoding Initiative)

ISO DIS 24610 Morpho-syntactic annotation

ISO NP 24611 Lexical markup framework

ISO NP 24612 (+ ISO NP 12620-3)

Task force on Meta-data for language resources (OLAC+IMDI) ACL/Sigsem working group on multimodal content representation Data category registry for ISO/TC 37

ISO CD 12620-1 on ballot (deadline Jan. 2004)

Page 11: Standards for language resources the ISO/TC 37(/SC 4) perspective

Modeling linguistic annotation structures

Page 12: Standards for language resources the ISO/TC 37(/SC 4) perspective

General framework - 1

Model for linguistic annotation that can be instantiated in a standard representational format

GMT: Generic Mapping Tool serve as a pivot format into and out of which

proprietary formats may be transduced to enable Comparison, merging, manipulation via common tools

Reference: ISO 16642 - Terminological Markup Framework

Page 13: Standards for language resources the ISO/TC 37(/SC 4) perspective

General framework - 2

A meta-model A general, underlying model that informs

current practice

A set of data-categories Provides to precise semantics of the format Obtained:

By sub-setting a Data Category RegistryBy providing application specific categories

Page 14: Standards for language resources the ISO/TC 37(/SC 4) perspective

ISO 16642: A family of formats

TMF

TML1 TML2 TML3 TMLi…

(TBX)(Geneter)

GMT

Page 15: Standards for language resources the ISO/TC 37(/SC 4) perspective
Page 16: Standards for language resources the ISO/TC 37(/SC 4) perspective

Meta-model

Terminological Data Collection (TDC)

Global Information (GI) Complementary Information (CI)

Terminological Entry (TE)

Language Section (LS)

Term Section (TS)

Term Component Section (TCS)

*

*

*

*

Page 17: Standards for language resources the ISO/TC 37(/SC 4) perspective

TMF: example

TE

TS

LSLS

TS

id=‘ID67’subjectField=‘ manufacturing ’definition=‘A value…’

lang=‘ hu ’lang=‘ en ’

term=‘…’term=‘alpha smoothing factor’termType=‘fullForm’

Page 18: Standards for language resources the ISO/TC 37(/SC 4) perspective

Implementation in TBX(cf. www.lisa.org)

<termEntry id='ID67'>

<descrip type='subjectField‘>manufacturing</descrip>

<descrip type='definition'>A value between 0 and 1 used in ...</descrip>

<langSet lang='en'>

<tig>

<term>alpha smoothing factor</term>

<termNote type='termType'>fullForm</termNote>

</tig>

</langSet>

<langSet lang='hu'>

<tig>

<term>Alfa ...</term>

</tig>

</langSet>

</termEntry>

Page 19: Standards for language resources the ISO/TC 37(/SC 4) perspective

Implementing a Data Category Registry for ISO TC37

Page 20: Standards for language resources the ISO/TC 37(/SC 4) perspective

Data Category

Definition: Elementary descriptor used in a linguistic description or annotation

scheme Example:

/Part of speech/, /Grammatical gender/, /Grammatical number/, /Feminine/, /Plural/, /Ablative/

Background: Experience gained from ISO 16642 in linguistic format specification Wider notion of data-categories as meta-data for tagged language

resources

Page 21: Standards for language resources the ISO/TC 37(/SC 4) perspective

Multiple uses of data categories

Data category selection

Meta model

Documentation

Meta-data

XML schemas

XSL filters

Page 22: Standards for language resources the ISO/TC 37(/SC 4) perspective

Application domains

Terminological data collection (TC 37/SC 3) Cf. “old” ISO 12620 set of data categories for terminology

Language codes (TC 37/SC 2) Cf. evolution from ISO 639-1 and ISO 639-2 to ISO 639-4

On-going and future SC4 activities (TC 37/SC 4) Meta-data for language resources Morpho-syntax/Syntax, Discourse level annotation NLP lexica, MT lexica Multilingual data representations (e.g. translation memories) and

access (query languages)

Page 23: Standards for language resources the ISO/TC 37(/SC 4) perspective

Technical background

ISO 11179 (ISO JTC 1/SC 32): meta-data registry view Provide mechanisms for the management of data categories

ISO 16642 (ISO TC 37/SC 3): terminology view Provides ways of dealing with multilingual issues

OWL (W3C Sem. Web activity): ontology view Provides a framework for dealing with hierarchies and expressing

constraints on data-categories E.g. a /noun/ can be described by means of /gender/ and /number/ in

French

Page 24: Standards for language resources the ISO/TC 37(/SC 4) perspective

Relation to ISO 11179

Data element concept Conceptual domain

Data element Value domain

Complex datcat Set of Simple datcats

/gender/ /masculine//feminine//neuter/

m, f, nImplemented as an XMLattribute named ‘gen’

XML schema declaration

<w lemme=“vert” gen=“f”>verte</w>

XML object List of values

Page 25: Standards for language resources the ISO/TC 37(/SC 4) perspective

The ISO 12620-1 proposal

Entry Identifier: genderProfile: morpho-syntaxDefinition (fr): Catégorie grammaticale reposant, selon les langues et les systèmes, sur la distinction naturelle entre les sexes ou sur des critères formels (Source: TLFi)Definition (en): Grammatical category… (Source: TLFi (Trad.))Conceptual Domain: {/feminine/, /masculine/, /neuter/}

Object Language: frName: genreConceptual Domain: {/feminine/, /masculine/}

Object Language: enName: gender

Object Language: deName: GeschlechtConceptual Domain: {/feminine/, /masculine/, /neuter/}

Page 26: Standards for language resources the ISO/TC 37(/SC 4) perspective

Perspectives

ISO/TC 37/SC 4 in a wider picture Basic building blocks to bring coherence in the

representation of linguistic information in a variety of application domains E.g. e-documentation, e-learning, e-business (e-catalogues),

multimedia, localisation… Provide vertical solution to linguistically based

applications E.g. Information extraction, indexing