36
ww.isocat.org ISOcat An ISO 12620:2009 Data Category Registry Marc Kemps-Snijders a , Menzo Windhouwer a , Sue Ellen Wright b a Max Planck Institute for Psycholinguistics, b Kent State University marc.kemps-snijders@mpi.nl , [email protected] , [email protected] 13/7/2010 1 CLARA 2010 Summer School

ISOcat: an ISO 12620:2009 Data Category Registry

Embed Size (px)

Citation preview

Page 1: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 1

ISOcatAn ISO 12620:2009 Data Category Registry

Marc Kemps-Snijdersa, Menzo Windhouwera, Sue Ellen Wrightb

aMax Planck Institute for Psycholinguistics, bKent State [email protected] , [email protected], [email protected]

13/7/2010

Page 2: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 2

Outline

• ISO 12620:2009– What are Data Categories?– How can you use Data Categories?– What is a Data Category Registry?– How can you use a Data Category Registry?

• ISOcat– Demonstration/Tutorial

• Future work• Handson session13/7/2010

Page 3: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 3

ISO 12620:2009

• Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources– An ISO TC 37/SC 3 standard (see [1])– Successor to ISO 12620:1999 which contained a

hardcoded list of Data Categories

13/7/2010

Page 4: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 4

What is a Data Category?

• The result of the specification of a given data field– A data category is an elementary descriptor in a linguistic

structure or an annotation scheme.

• Specification consists of 3 main parts:– Administrative part

• Administration and identification

– Descriptive part• Documentation in various working languages

– Linguistic part• Conceptual domain(s for various object languages)

13/7/2010

Page 5: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 5

Data category example

• Data category: /Grammatical gender/– Administrative part:

• Identifier: grammaticalGender• PID: http://www.isocat.org/datcat/DC-1297

– Descriptive part:• English definition: Category based on (depending on languages)

the natural distinction between sex and formal criteria.• French definition: Catégorie fondée (selon la langue) sur la

distinction naturelle entre les sexes ou d'autres critères formels.

– Linguistic part:• Morposyntax conceptual domain: /male/, /feminine/, /neuter/• French conceptual domain: /male/, /feminine/

13/7/2010

Page 6: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 6

Data Category specification – Administrative part

13/7/2010

DCR

Data CategoryGlobal Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Change

0..1

1

0..1

1

0..1

1

0..1

1

1..*

1

0..1

1

1..*

1

1

1

1

1

Page 7: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 7

Data Category specification – Descriptive part

13/7/2010

Data Category

Description Section

Language Section

Name Section

Definition Example

Explanation

Data Element Name

0..*

1

1

1

0..*

1

0..*

1

1..*

1

0..*

1

0..*

1

Page 8: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 8

Data Category specification – Linguistic part

13/7/2010

Data Category

Complex Data Category

Simple Data Category

Closed Data Category

Open Data Category

Constrained Data Category

Linguistic Section

Closed Linguistic Section

Constrained Linguistic Section

Conceptual Domain

Value Domain

Open Conceptual Domain

Schema Specific Domain

Profile Value Domain

Example Explanation

0..*

1

0..*1

1..*1

0..*

1

1..*

1 0..*

1

0..10..*

11

0..* 1

1..*

1

1..* 1

0..1 1

0..*

1

0..*

1

1..*1

Page 9: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 9

Mandatory parts of the specification

• For each data category:– a mnemonic identifier– an English definition– an English name

• For complex data categories:– a conceptual domain

• For standardization candidates:– a profile– a justification

13/7/2010

Page 10: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 10

Guidelines for the specification

13/7/2010

• Identifier:– camel case and XML-valid element name (without a

namespace)• partOfSpeech• my:POS, 123POS

• Data Element Name:– language independent name for the data category used in

a specific application domain (specified in the source)• PoS in TBX• NN in myTagset or N in yourTagset (if widely used)

(see [2])

Page 11: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 11

More guidelines

13/7/2010

• Name Section in a Language Section– legible name

• ‘part of speech’ in the English language section• ‘partie du discours’ in the French language section

• Definition:– intentional definitions (ISO 704)– should consist of a single sentence fragment

• Source:– add a source for any quoted material

Page 12: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 12

More guidelines

13/7/2010

• Justification:– a simple statement justifying the relevance of the

data category to the field of language resources– especially needed for standardization

Page 13: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 13

Data Category types

13/7/2010

writtenForm

string

open

grammaticalGender

string

neuter

masculine

feminine

closed

simple:

email

string

constrained

Constraint: .+@.+

complex:

Page 14: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 14

Data Category relationships

13/7/2010

• Value domain membership

• Subsumption relationships between simple data categories

• Relationships between complex data categories are not stored in the DCR

partOfSpeech

string

pronoun

personalpronoun

Page 15: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 15

How can you use Data Categories?

13/7/2010

Lexicon

Lexical Entry

Form Sense

0..*

0..*1..*

1..*

partOfSpeech

writtenForm

writtenForm

grammaticalGender

lexicalType

Word Form

Lemma

Language BWO genders

grammaticalGenderwordOrder

A LMF (ISO 24613:2008) complaint(schema for a) lexicon

A (schema for a) typological database

Shar

ed se

man

tics!

Page 16: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 16

How?

13/7/2010

<lmf:lexicon xml:lang=“jp” alphabet=“ipa”><lmf:entry><lmf:lemma><lmf:writtenForm>nihongo</…>…</…>…</…>…

</…>

Page 17: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 17

Referencing Data Categories

• Each Data Category should be uniquely identifiable– Ambiguity: different domains use the same term but mean

different ‘things’– Semantic rot: even in the same domain the meaning of a

term changes over time– Persistence: for archived resources Data Category references

should still be resolvable and point to the specification as it was at/close to time of creation

• ISO/DIS 24619 Language resource management -- Persistent identification and access in language technology applications

13/7/2010

Page 18: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 18

Data Categories Persistent IDentifiers

• persistent identifier (PID)– “unique Uniform Resource Identifier (URI) that

ensures permanent access for a digital object by providing access to it independently of its physical location or current ownership” (see [1])

• For Data Categories this digital object is a specific version of a Data Category specification, i.e., each version of a Data Category has its own PID

13/7/2010

Page 19: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 19

Where do you put these references?

• Preferably in a schema:

<rng:attribute name=“alphabet” dcr:datcat=“http://www.isocat.org/datcat/…”><rng:value dcr:datcat=“http://www.isocat.org/datcat/…”>

ipa</…>…

</…>

13/7/2010

Page 20: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 20

ISO TC 37 standards using Data Categories

• Terminological Markup Framework (TMF; ISO 16642)• Lexical Markup Framework (LMF; ISO 24613)• TermBase eXchange (TBX; ISO 30042)• Morpho-syntactic Annotation Framework (MAF; ISO 24611)• Linguistic Annotation Framework (LAF; ISO 24612)

• Meta models which can be instantiated into a specific model with data categories

• However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3])

13/7/2010

Page 21: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 21

Other uses of Data Categories

• CLARIN Component Metadata Infrastructure (CMDI)• ISO 12620:2009 provides a small XML vocabulary, DC

Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents– Including: XML Schema, Relax NG, TEI/ISO feature

structures, …• The references can be used in URI based ‘mappings’:

– Including: ODD, RDF-based vocabularies (OWL, SKOS), …

13/7/2010

Page 22: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 22

What is a Data Category Registry?

• A (coherent) set of Data Categories, in our case for linguistic resources

• A system to manage this set:– Create and edit Data Categories– Share Data Categories, e.g., resolve PID references– Standardize Data Categories

13/7/2010

Page 23: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 23

Standardize Data Categories

13/7/2010

Submissiongroup

Data Category RegistryBoard

Validation

Thematic DomainGroup

Evaluation

Stewardshipgroup

Decision Group

rejected rejected

Publication

Page 24: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 24

Thematic Domain Groups

13/7/2010

TDG 1: MetadataTDG 2: MorphosyntaxTDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable DictionaryTDG 6: Language Resource OntologyTDG 7: LexicographyTDG 8: Language CodesTDG 9: TerminologyTDG 11: Multilingual Information ManagementTDG 12: Lexical ResourcesTDG 13: Lexical SemanticsTDG 14: Source Identification

• TDGs are the owner and guardians of a coherent subset of the DCR

• TDGs own one or more profiles

• Each TDG has a chair• A number of judges (assigned by

SC P members)• A number of expert members (up

to 50%)

• TDGs are constituted at the TC37/SC plenary

• New TDGs need to be proposed by a SC

1. Translation2. Sign language3. Audio

Page 25: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 25

How can you use a Data Category Registry?

• You can:– Find Data Categories relevant for your resources and embed

references to them so the semantics of (parts of) your resources are made explicit

• This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI Component Editor directly interact with ISOcat

– Interact with Data Category owners to improve (the coverage of) their Data Categories

– Create (together with others) new Data Categories needed for your resources and share those

– Submit (your) Data Categories for standardization

– Free of charge

13/7/2010

Page 26: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 26

ISOcat

• Reference implementation of ISO 12620:2009• The TC 37 Data Category Registry

13/7/2010

Page 27: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 2713/7/2010

A glimpse of ISOcat

Page 28: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 28

Data Category Interchange Format (DCIF)

• Simplified XML serialization of the data model (see [4])

13/7/2010

Data Category Selection

[1]

Global Information[0..1]

Data Category[1..*]

Administration Information Section

[1]

Description Section[1]

Administration Record

[1]

Language Section[1..*]

Name Section [0..*]

Data Element NameSection

[0..*]

Conceptual Domain [0..*]

Linguistic Section [0..*]

Conceptual Domain [0..*]

Page 29: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 29

RESTful Web Services

• read-only programming interface to the DCR (see [5])• allows tools to interact with ISOcat to help an user to

embed PIDs in their resources• mainly based on DCIF• uses authentication to access private/shared Data

Categories• currently used by:

– LEXUS: populate an LMF model– ELAN: create controlled vocabularies– CMDI Component Editor: create concept links for component

elements

13/7/2010

Page 30: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 30

Persistent IDentifiers

• ISOcat uses ‘cool URIs’ as PIDs (see [6])– these URIs will never change, but resolve to the

current location in the current implementation, e.g., in ISOcat they resolve to a RESTful Web Service call

– the isocat.org domain is bound to ISO 12620:2009 and the Registration Authority, currently the MPI, is obliged to keep the PIDs associated with this domain resolvable

13/7/2010

Page 31: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 31

Future work

• Finish first complete version of ISOcat:– Standardization process

• Cleanup of the current set of Data Categories– TDGs cleanup their profiles– Standardize first sets of Data Categories

• Interaction with other TC 37 standards:– Migration from ISO 12620:1999– Full support for all types of Data Categories

13/7/2010

Page 32: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 32

More future work

• Additional Data Categories types– Container Data Categories

• Complex and Simple only cover ‘leafs’ and their values

– Data Category Concepts• Basic building blocks for knowledge bases

• Relation Registries– Stores (your) (semantic) relationships between

Data Categories

13/7/2010

Page 33: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 33

Registry network

13/7/2010

Linguistic resources

Data category registries

Relation registries

MPIDCR

ISODCR

Typological Database SystemRRMPI RR

MPIarchive

TDSdatabaseresource

Page 34: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 34

Handson session• Register with ISOcat

– http://www.isocat.org/• Can you find Data Categories relevant to your type of resources?

– Use the search (options), the explorer, …– Create and save a Data Category Selection

• Do you miss Data Categories?– Create a new Data Category

• Do you want to share Data Categories/selections?– Create together with some students a group– Share your selection with this group

• Do you miss functionality?– Let us know

13/7/2010

Page 35: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 3513/7/2010

Thank you for your attention!

Visitwww.isocat.org

Questions?www.isocat.org/forum/

[email protected]

Page 36: ISOcat: an ISO 12620:2009 Data Category Registry

www.isocat.org

CLARA 2010 Summer School 36

References[1] ISO 12620, Terminology and other language and content resources --

Specification of data categories and management of a Data Category Registry for language resources.

[2] http://www.isocat.org/manual/DCRGuidelines.pdf[3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcat

data categories. In proceedings of the LREC 2010 LRT standards workshop. Malta, May 18, 2010.

[4] http://www.isocat.org/12620/[5] http://www.isocat.org/rest/help.html[6] Tim Berners-Lee, Cool URIs don't change, 1998.

13/7/2010